|  | Register allocation in Subzero | 
|  | ============================== | 
|  |  | 
|  | Introduction | 
|  | ------------ | 
|  |  | 
|  | `Subzero | 
|  | <https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DESIGN.rst>`_ | 
|  | is a fast code generator that translates architecture-independent `PNaCl bitcode | 
|  | <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into | 
|  | architecture-specific machine code.  PNaCl bitcode is LLVM bitcode that has been | 
|  | simplified (e.g. weird-width primitive types like 57-bit integers are not | 
|  | allowed) and has had architecture-independent optimizations already applied. | 
|  | Subzero aims to generate high-quality code as fast as practical, and as such | 
|  | Subzero needs to make tradeoffs between compilation speed and output code | 
|  | quality. | 
|  |  | 
|  | In Subzero, we have found register allocation to be one of the most important | 
|  | optimizations toward achieving the best code quality, which is in tension with | 
|  | register allocation's reputation for being complex and expensive.  Linear-scan | 
|  | register allocation is a modern favorite for getting fairly good register | 
|  | allocation at relatively low cost.  Subzero uses linear-scan for its core | 
|  | register allocator, with a few internal modifications to improve its | 
|  | effectiveness (specifically: register preference, register overlap, and register | 
|  | aliases).  Subzero also does a few high-level things on top of its core register | 
|  | allocator to improve overall effectiveness (specifically: repeat until | 
|  | convergence, delayed phi lowering, and local live range splitting). | 
|  |  | 
|  | What we describe here are techniques that have worked well for Subzero, in the | 
|  | context of its particular intermediate representation (IR) and compilation | 
|  | strategy.  Some of these techniques may not be applicable to another compiler, | 
|  | depending on its particular IR and compilation strategy.  Some key concepts in | 
|  | Subzero are the following: | 
|  |  | 
|  | - Subzero's ``Variable`` operand is an operand that resides either on the stack | 
|  | or in a physical register.  A Variable can be tagged as *must-have-register* | 
|  | or *must-not-have-register*, but its default is *may-have-register*.  All uses | 
|  | of the Variable map to the same physical register or stack location. | 
|  |  | 
|  | - Basic lowering is done before register allocation.  Lowering is the process of | 
|  | translating PNaCl bitcode instructions into native instructions.  Sometimes a | 
|  | native instruction, like the x86 ``add`` instruction, allows one of its | 
|  | Variable operands to be either in a physical register or on the stack, in | 
|  | which case the lowering is relatively simple.  But if the lowered instruction | 
|  | requires the operand to be in a physical register, we generate code that | 
|  | copies the Variable into a *must-have-register* temporary, and then use that | 
|  | temporary in the lowered instruction. | 
|  |  | 
|  | - Instructions within a basic block appear in a linearized order (as opposed to | 
|  | something like a directed acyclic graph of dependency edges). | 
|  |  | 
|  | - An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually | 
|  | small) number of *source* Variables.  Assuming SSA form, the instruction | 
|  | begins the live range of the dest Variable, and may end the live range of one | 
|  | or more of the source Variables. | 
|  |  | 
|  | Executive summary | 
|  | ----------------- | 
|  |  | 
|  | - Liveness analysis and live range construction are prerequisites for register | 
|  | allocation.  Without careful attention, they can be potentially very | 
|  | expensive, especially when the number of variables and basic blocks gets very | 
|  | large.  Subzero uses some clever data structures to take advantage of the | 
|  | sparsity of the data, resulting in stable performance as function size scales | 
|  | up.  This means that the proportion of compilation time spent on liveness | 
|  | analysis stays roughly the same. | 
|  |  | 
|  | - The core of Subzero's register allocator is taken from `Mössenböck and | 
|  | Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on | 
|  | linear-scan register allocation. | 
|  |  | 
|  | - We enhance the core algorithm with a good automatic preference mechanism when | 
|  | more than one register is available, to try to minimize register shuffling. | 
|  |  | 
|  | - We also enhance it to allow overlapping live ranges to share the same | 
|  | register, when one variable is recognized as a read-only copy of the other. | 
|  |  | 
|  | - We deal with register aliasing in a clean and general fashion.  Register | 
|  | aliasing is when e.g. 16-bit architectural registers share some bits with | 
|  | their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit | 
|  | registers. | 
|  |  | 
|  | - We improve register allocation by rerunning the algorithm on likely candidates | 
|  | that didn't get registers during the previous iteration, without imposing much | 
|  | additional cost. | 
|  |  | 
|  | - The main register allocation is done before phi lowering, because phi lowering | 
|  | imposes early and unnecessary ordering constraints on the resulting | 
|  | assigments, which create spurious interferences in live ranges. | 
|  |  | 
|  | - Within each basic block, we aggressively split each variable's live range at | 
|  | every use, so that portions of the live range can get registers even if the | 
|  | whole live range can't.  Doing this separately for each basic block avoids | 
|  | merge complications, and keeps liveness analysis and register allocation fast | 
|  | by fitting well into the sparsity optimizations of their data structures. | 
|  |  | 
|  | Liveness analysis | 
|  | ----------------- | 
|  |  | 
|  | With respect to register allocation, the main purpose of liveness analysis is to | 
|  | calculate the live range of each variable.  The live range is represented as a | 
|  | set of instruction number ranges.  Instruction numbers within a basic block must | 
|  | be monotonically increasing, and the instruction ranges of two different basic | 
|  | blocks must not overlap. | 
|  |  | 
|  | Basic liveness | 
|  | ^^^^^^^^^^^^^^ | 
|  |  | 
|  | Liveness analysis is a straightforward dataflow algorithm.  For each basic | 
|  | block, we keep track of the live-in and live-out set, i.e. the set of variables | 
|  | that are live coming into or going out of the basic block.  Processing of a | 
|  | basic block starts by initializing a temporary set as the union of the live-in | 
|  | sets of the basic block's successor blocks.  (This basic block's live-out set is | 
|  | captured as the initial value of the temporary set.)  Then each instruction of | 
|  | the basic block is processed in reverse order.  All the source variables of the | 
|  | instruction are marked as live, by adding them to the temporary set, and the | 
|  | dest variable of the instruction (if any) is marked as not-live, by removing it | 
|  | from the temporary set.  When we finish processing all of the block's | 
|  | instructions, we add/union the temporary set into the basic block's live-in set. | 
|  | If this changes the basic block's live-in set, then we mark all of this basic | 
|  | block's predecessor blocks to be reprocessed.  Then we repeat for other basic | 
|  | blocks until convergence, i.e. no more basic blocks are marked to be | 
|  | reprocessed.  If basic blocks are processed in reverse topological order, then | 
|  | the number of times each basic block need to be reprocessed is generally its | 
|  | loop nest depth. | 
|  |  | 
|  | The result of this pass is the live-in and live-out set for each basic block. | 
|  |  | 
|  | With so many set operations, choice of data structure is crucial for | 
|  | performance.  We tried a few options, and found that a simple dense bit vector | 
|  | works best.  This keeps the per-instruction cost very low.  However, we found | 
|  | that when the function gets very large, merging and copying bit vectors at basic | 
|  | block boundaries dominates the cost.  This is due to the number of variables | 
|  | (hence the bit vector size) and the number of basic blocks getting large. | 
|  |  | 
|  | A simple enhancement brought this under control in Subzero.  It turns out that | 
|  | the vast majority of variables are referenced, and therefore live, only in a | 
|  | single basic block.  This is largely due to the SSA form of PNaCl bitcode.  To | 
|  | take advantage of this, we partition variables by single-block versus | 
|  | multi-block liveness.  Multi-block variables get lower-numbered bit vector | 
|  | indexes, and single-block variables get higher-number indexes.  Single-block bit | 
|  | vector indexes are reused across different basic blocks.  As such, the size of | 
|  | live-in and live-out bit vectors is limited to the number of multi-block | 
|  | variables, and the temporary set's size can be limited to that plus the largest | 
|  | number of single-block variables across all basic blocks. | 
|  |  | 
|  | For the purpose of live range construction, we also need to track definitions | 
|  | (LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the | 
|  | basic block.  These are easy to detect while processing the instructions; data | 
|  | structure choices are described below. | 
|  |  | 
|  | Live range construction | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | After the live-in and live-out sets are calculated, we construct each variable's | 
|  | live range (as an ordered list of instruction ranges, described above).  We do | 
|  | this by considering the live-in and live-out sets, combined with LiveBegin and | 
|  | LiveEnd information.  This is done separately for each basic block. | 
|  |  | 
|  | As before, we need to take advantage of sparsity of variable uses across basic | 
|  | blocks, to avoid overly copying/merging data structures.  The following is what | 
|  | worked well for Subzero (after trying several other options). | 
|  |  | 
|  | The basic liveness pass, described above, keeps track of when a variable's live | 
|  | range begins or ends within the block.  LiveBegin and LiveEnd are unordered | 
|  | vectors where each element is a pair of the variable and the instruction number, | 
|  | representing that the particular variable's live range begins or ends at the | 
|  | particular instruction.  When the liveness pass finds a variable whose live | 
|  | range begins or ends, it appends and entry to LiveBegin or LiveEnd. | 
|  |  | 
|  | During live range construction, the LiveBegin and LiveEnd vectors are sorted by | 
|  | variable number.  Then we iterate across both vectors in parallel.  If a | 
|  | variable appears in both LiveBegin and LiveEnd, then its live range is entirely | 
|  | within this block.  If it appears in only LiveBegin, then its live range starts | 
|  | here and extends through the end of the block.  If it appears in only LiveEnd, | 
|  | then its live range starts at the beginning of the block and ends here.  (Note | 
|  | that this only covers the live range within this block, and this process is | 
|  | repeated across all blocks.) | 
|  |  | 
|  | It is also possible that a variable is live within this block but its live range | 
|  | does not begin or end in this block.  These variables are identified simply by | 
|  | taking the intersection of the live-in and live-out sets. | 
|  |  | 
|  | As a result of these data structures, performance of liveness analysis and live | 
|  | range construction tend to be stable across small, medium, and large functions, | 
|  | as measured by a fairly consistent proportion of total compilation time spent on | 
|  | the liveness passes. | 
|  |  | 
|  | Linear-scan register allocation | 
|  | ------------------------------- | 
|  |  | 
|  | The basis of Subzero's register allocator is the allocator described by | 
|  | Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in | 
|  | the Context of SSA Form and Register Constraints | 
|  | <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  It allows live ranges to | 
|  | be a union of intervals rather than a single conservative interval, and it | 
|  | allows pre-coloring of variables with specific physical registers. | 
|  |  | 
|  | The paper suggests an approach of aggressively coalescing variables across Phi | 
|  | instructions (i.e., trying to force Phi source and dest variables to have the | 
|  | same register assignment), but we omit that in favor of the more natural | 
|  | preference mechanism described below. | 
|  |  | 
|  | We found the paper quite remarkable in that a straightforward implementation of | 
|  | its pseudo-code led to an entirely correct register allocator.  The only thing | 
|  | we found in the specification that was even close to a mistake is that it was | 
|  | too aggressive in evicting inactive ranges in the "else" clause of the | 
|  | AssignMemLoc routine.  An inactive range only needs to be evicted if it actually | 
|  | overlaps the current range being considered, whereas the paper evicts it | 
|  | unconditionally.  (Search for "original paper" in Subzero's register allocator | 
|  | source code.) | 
|  |  | 
|  | Register preference | 
|  | ------------------- | 
|  |  | 
|  | The linear-scan algorithm from the paper talks about choosing an available | 
|  | register, but isn't specific on how to choose among several available registers. | 
|  | The simplest approach is to just choose the first available register, e.g. the | 
|  | lowest-numbered register.  Often a better choice is possible. | 
|  |  | 
|  | Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)`` | 
|  | with source variables ``S1,S2,...``, and that instruction ends the live range of | 
|  | one of those source variables ``Sn``, and ``Sn`` was assigned a register, then | 
|  | ``Sn``'s register is usually a good choice for ``V``.  This is especially true | 
|  | when the instruction is a simple assignment, because an assignment where the | 
|  | dest and source variables end up with the same register can be trivially elided, | 
|  | reducing the amount of register-shuffling code. | 
|  |  | 
|  | This requires being able to find and inspect a variable's defining instruction, | 
|  | which is not an assumption in the original paper but is easily tracked in | 
|  | practice. | 
|  |  | 
|  | Register overlap | 
|  | ---------------- | 
|  |  | 
|  | Because Subzero does register allocation after basic lowering, the lowering has | 
|  | to be prepared for the possibility of any given program variable not getting a | 
|  | physical register.  It does this by introducing *must-have-register* temporary | 
|  | variables, and copies the program variable into the temporary to ensure that | 
|  | register requirements in the target instruction are met. | 
|  |  | 
|  | In many cases, the program variable and temporary variable have overlapping live | 
|  | ranges, and would be forced to have different registers even if the temporary | 
|  | variable is effectively a read-only copy of the program variable.  We recognize | 
|  | this when the program variable has no definitions within the temporary | 
|  | variable's live range, and the temporary variable has no definitions within the | 
|  | program variable's live range with the exception of the copy assignment. | 
|  |  | 
|  | This analysis is done as part of register preference detection. | 
|  |  | 
|  | The main impact on the linear-scan implementation is that instead of | 
|  | setting/resetting a boolean flag to indicate whether a register is free or in | 
|  | use, we increment/decrement a number-of-uses counter. | 
|  |  | 
|  | Register aliases | 
|  | ---------------- | 
|  |  | 
|  | Sometimes registers of different register classes partially overlap.  For | 
|  | example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't | 
|  | alias each other), and all three alias ``eax`` and ``rax``.  And in ARM, | 
|  | registers ``s0`` and ``s1`` (which are single-precision floating-point | 
|  | registers) alias ``d0`` (double-precision floating-point), and registers ``d0`` | 
|  | and ``d1`` alias ``q0`` (128-bit vector register).  The linear-scan paper | 
|  | doesn't address this issue; it assumes that all registers are independent.  A | 
|  | simple solution is to essentially avoid aliasing by disallowing a subset of the | 
|  | registers, but there is obviously a reduction in code quality when e.g. half of | 
|  | the registers are taken away. | 
|  |  | 
|  | Subzero handles this more elegantly.  For each register, we keep a bitmask | 
|  | indicating which registers alias/conflict with it.  For example, in x86, | 
|  | ``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and | 
|  | in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``.  Whenever we want to | 
|  | mark a register as being used or released, we do the same for all of its | 
|  | aliases. | 
|  |  | 
|  | Before implementing this, we were a bit apprehensive about the compile-time | 
|  | performance impact.  Instead of setting one bit in a bit vector or decrementing | 
|  | one counter, this generally needs to be done in a loop that iterates over all | 
|  | aliases.  Fortunately, this seemed to make very little difference in | 
|  | performance, as the bulk of the cost ends up being in live range overlap | 
|  | computations, which are not affected by register aliasing. | 
|  |  | 
|  | Repeat until convergence | 
|  | ------------------------ | 
|  |  | 
|  | Sometimes the linear-scan algorithm makes a register assignment only to later | 
|  | revoke it in favor of a higher-priority variable, but it turns out that a | 
|  | different initial register choice would not have been revoked.  For relatively | 
|  | low compile-time cost, we can give those variables another chance. | 
|  |  | 
|  | During register allocation, we keep track of the revoked variables and then do | 
|  | another round of register allocation targeted only to that set.  We repeat until | 
|  | no new register assignments are made, which is usually just a handful of | 
|  | successively cheaper iterations. | 
|  |  | 
|  | Another approach would be to repeat register allocation for *all* variables that | 
|  | haven't had a register assigned (rather than variables that got a register that | 
|  | was subsequently revoked), but our experience is that this greatly increases | 
|  | compile-time cost, with little or no code quality gain. | 
|  |  | 
|  | Delayed Phi lowering | 
|  | -------------------- | 
|  |  | 
|  | The linear-scan algorithm works for phi instructions as well as regular | 
|  | instructions, but it is tempting to lower phi instructions into non-SSA | 
|  | assignments before register allocation, so that register allocation need only | 
|  | happen once. | 
|  |  | 
|  | Unfortunately, simple phi lowering imposes an arbitrary ordering on the | 
|  | resulting assignments that can cause artificial overlap/interference between | 
|  | lowered assignments, and can lead to worse register allocation decisions.  As a | 
|  | simple example, consider these two phi instructions which are semantically | 
|  | unordered:: | 
|  |  | 
|  | A = phi(B) // B's live range ends here | 
|  | C = phi(D) // D's live range ends here | 
|  |  | 
|  | A straightforward lowering might yield:: | 
|  |  | 
|  | A = B // end of B's live range | 
|  | C = D // end of D's live range | 
|  |  | 
|  | The potential problem here is that A and D's live ranges overlap, and so they | 
|  | are prevented from having the same register.  Swapping the order of lowered | 
|  | assignments fixes that (but then B and C would overlap), but we can't really | 
|  | know which is better until after register allocation. | 
|  |  | 
|  | Subzero deals with this by doing the main register allocation before phi | 
|  | lowering, followed by phi lowering, and finally a special register allocation | 
|  | pass limited to the new lowered assignments. | 
|  |  | 
|  | Phi lowering considers the phi operands separately for each predecessor edge, | 
|  | and starts by finding a topological sort of the Phi instructions, such that | 
|  | assignments can be executed in that order without violating dependencies on | 
|  | registers or stack locations.  If a topological sort is not possible due to a | 
|  | cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can | 
|  | become ``T=A;A=B;B=C;C=T``.  The topological order is tuned to favor freeing up | 
|  | registers early to reduce register pressure. | 
|  |  | 
|  | It then lowers the linearized assignments into machine instructions (which may | 
|  | require extra physical registers e.g. to copy from one stack location to | 
|  | another), and finally runs the register allocator limited to those instructions. | 
|  |  | 
|  | In rare cases, the register allocator may fail on some *must-have-register* | 
|  | variable, because register pressure is too high to satisfy requirements arising | 
|  | from cycle-breaking temporaries and registers required for stack-to-stack | 
|  | copies.  In this case, we have to find a register with no active uses within the | 
|  | variable's live range, and actively spill/restore that register around the live | 
|  | range.  This makes the code quality suffer and may be slow to implement | 
|  | depending on compiler data structures, but in practice we find the situation to | 
|  | be vanishingly rare and so not really worth optimizing. | 
|  |  | 
|  | Local live range splitting | 
|  | -------------------------- | 
|  |  | 
|  | The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets | 
|  | a register for its entire live range, or not at all.  This is unfortunate when a | 
|  | variable has many uses close together, but ultimately a large enough live range | 
|  | to prevent register assignment.  Another bad example is on x86 where all vector | 
|  | and floating-point registers (the ``xmm`` registers) are killed by call | 
|  | instructions, per the x86 call ABI, so such variables are completely prevented | 
|  | from having a register when their live ranges contain a call instruction. | 
|  |  | 
|  | The general solution here is some policy for splitting live ranges.  A variable | 
|  | can be split into multiple copies and each can be register-allocated separately. | 
|  | The complication comes in finding a sane policy for where and when to split | 
|  | variables such that complexity doesn't explode, and how to join the different | 
|  | values at merge points. | 
|  |  | 
|  | Subzero implements aggressive block-local splitting of variables.  Each basic | 
|  | block is handled separately and independently.  Within the block, we maintain a | 
|  | table ``T`` that maps each variable ``V`` to its split version ``T[V]``, with | 
|  | every variable ``V``'s initial state set (implicitly) as ``T[V]=V``.  For each | 
|  | instruction in the block, and for each *may-have-register* variable ``V`` in the | 
|  | instruction, we do the following: | 
|  |  | 
|  | - Replace all uses of ``V`` in the instruction with ``T[V]``. | 
|  |  | 
|  | - Create a new split variable ``V'``. | 
|  |  | 
|  | - Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately | 
|  | before or immediately after) the current instruction. | 
|  |  | 
|  | - Update ``T[V]`` to be ``V'``. | 
|  |  | 
|  | This leads to a chain of copies of ``V`` through the block, linked by assignment | 
|  | instructions.  The live ranges of these copies are usually much smaller, and | 
|  | more likely to get registers.  In fact, because of the preference mechanism | 
|  | described above, they are likely to get the same register whenever possible. | 
|  |  | 
|  | One obvious question comes up: won't this proliferation of new variables cause | 
|  | an explosion in the running time of liveness analysis and register allocation? | 
|  | As it turns out, not really. | 
|  |  | 
|  | First, for register allocation, the cost tends to be dominated by live range | 
|  | overlap computations, whose cost is roughly proportional to the size of the live | 
|  | range.  All the new variable copies' live ranges sum up to the original | 
|  | variable's live range, so the cost isn't vastly greater. | 
|  |  | 
|  | Second, for liveness analysis, the cost is dominated by merging bit vectors | 
|  | corresponding to the set of variables that have multi-block liveness.  All the | 
|  | new copies are guaranteed to be single-block, so the main additional cost is | 
|  | that of processing the new assignments. | 
|  |  | 
|  | There's one other key issue here.  The original variable and all of its copies | 
|  | need to be "linked", in the sense that all of these variables that get a stack | 
|  | slot (because they didn't get a register) are guaranteed to have the same stack | 
|  | slot.  This way, we can avoid generating any code related to ``V'=V`` when | 
|  | neither gets a register.  In addition, we can elide instructions that write a | 
|  | value to a stack slot that is linked to another stack slot, because that is | 
|  | guaranteed to be just rewriting the same value to the stack. |