| ================== |
| Vectorization Plan |
| ================== |
| |
| .. contents:: |
| :local: |
| |
| Abstract |
| ======== |
| The vectorization transformation can be rather complicated, involving several |
| potential alternatives, especially for outer-loops [1]_ but also possibly for |
| innermost loops. These alternatives may have significant performance impact, |
| both positive and negative. A cost model is therefore employed to identify the |
| best alternative, including the alternative of avoiding any transformation |
| altogether. |
| |
| The Vectorization Plan is an explicit model for describing vectorization |
| candidates. It serves for both optimizing candidates including estimating their |
| cost reliably, and for performing their final translation into IR. This |
| facilitates dealing with multiple vectorization candidates. |
| |
| High-level Design |
| ================= |
| |
| Vectorization Workflow |
| ---------------------- |
| VPlan-based vectorization involves three major steps, taking a "scenario-based |
| approach" to vectorization planning: |
| |
| 1. Legal Step: check if a loop can be legally vectorized; encode constraints and |
| artifacts if so. |
| 2. Plan Step: |
| |
| a. Build initial VPlans following the constraints and decisions taken by |
| Legal Step 1, and compute their cost. |
| b. Apply optimizations to the VPlans, possibly forking additional VPlans. |
| Prune sub-optimal VPlans having relatively high cost. |
| 3. Execute Step: materialize the best VPlan. Note that this is the only step |
| that modifies the IR. |
| |
| Design Guidelines |
| ----------------- |
| In what follows, the term "input IR" refers to code that is fed into the |
| vectorizer whereas the term "output IR" refers to code that is generated by the |
| vectorizer. The output IR contains code that has been vectorized or "widened" |
| according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed |
| according to an Unroll Factor (UF). |
| The design of VPlan follows several high-level guidelines: |
| |
| 1. Analysis-like: building and manipulating VPlans must not modify the input IR. |
| In particular, if the best option is not to vectorize at all, the |
| vectorization process terminates before reaching Step 3, and compilation |
| should proceed as if VPlans had not been built. |
| |
| 2. Align Cost & Execute: each VPlan must support both estimating the cost and |
| generating the output IR code, such that the cost estimation evaluates the |
| to-be-generated code reliably. |
| |
| 3. Support vectorizing additional constructs: |
| |
| a. Outer-loop vectorization. In particular, VPlan must be able to model the |
| control-flow of the output IR which may include multiple basic-blocks and |
| nested loops. |
| b. SLP vectorization. |
| c. Combinations of the above, including nested vectorization: vectorizing |
| both an inner loop and an outer-loop at the same time (each with its own |
| VF and UF), mixed vectorization: vectorizing a loop with SLP patterns |
| inside [4]_, (re)vectorizing input IR containing vector code. |
| d. Function vectorization [2]_. |
| |
| 4. Support multiple candidates efficiently. In particular, similar candidates |
| related to a range of possible VF's and UF's must be represented efficiently. |
| Potential versioning needs to be supported efficiently. |
| |
| 5. Support vectorizing idioms, such as interleaved groups of strided loads or |
| stores. This is achieved by modeling a sequence of output instructions using |
| a "Recipe", which is responsible for computing its cost and generating its |
| code. |
| |
| 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization |
| such regions may need to be, for example, predicated and linearized, or |
| replicated VF*UF times to handle scalarized and predicated instructions. |
| Innerloops are also modelled as SESE regions. |
| |
| 7. Support instruction-level analysis and transformation, as part of Planning |
| Step 2.b: During vectorization instructions may need to be traversed, moved, |
| replaced by other instructions or be created. For example, vector idiom |
| detection and formation involves searching for and optimizing instruction |
| patterns. |
| |
| Definitions |
| =========== |
| The low-level design of VPlan comprises of the following classes. |
| |
| :LoopVectorizationPlanner: |
| A LoopVectorizationPlanner is designed to handle the vectorization of a loop |
| or a loop nest. It can construct, optimize and discard one or more VPlans, |
| each VPlan modelling a distinct way to vectorize the loop or the loop nest. |
| Once the best VPlan is determined, including the best VF and UF, this VPlan |
| drives the generation of output IR. |
| |
| :VPlan: |
| A model of a vectorized candidate for a given input IR loop or loop nest. This |
| candidate is represented using a Hierarchical CFG. VPlan supports estimating |
| the cost and driving the generation of the output IR code it represents. |
| |
| :Hierarchical CFG: |
| A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The |
| Hierarchical CFG data structure is similar to the Tile Tree [5]_, where |
| cross-Tile edges are lifted to connect Tiles instead of the original |
| basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms |
| Region and Block are used rather than Tile [5]_ to avoid confusion with loop |
| tiling. |
| |
| :VPBlockBase: |
| The building block of the Hierarchical CFG. A pure-virtual base-class of |
| VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical |
| control-flow relations with other VPBlocks. Note that in contrast to the IR |
| BasicBlock, a VPBlockBase models its control-flow successors and predecessors |
| directly, rather than through a Terminator branch or through predecessor |
| branches that "use" the VPBlockBase. |
| |
| :VPBasicBlock: |
| VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the |
| Hierarchical CFG. It represents a sequence of output IR instructions that will |
| appear consecutively in an output IR basic-block. The instructions of this |
| basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a |
| sequence of zero or more VPRecipes that model the cost and generation of the |
| output IR instructions. |
| |
| :VPRegionBlock: |
| VPRegionBlock is a subclass of VPBlockBase. It models a collection of |
| VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR |
| CFG. A VPRegionBlock may indicate that its contents are to be replicated a |
| constant number of times when output IR is generated, effectively representing |
| a loop with constant trip-count that will be completely unrolled. This is used |
| to support scalarized and predicated instructions with a single model for |
| multiple candidate VF's and UF's. |
| |
| :VPRecipeBase: |
| A pure-virtual base class modeling a sequence of one or more output IR |
| instructions, possibly based on one or more input IR instructions. These |
| input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe |
| may specify how its ingredients are to be transformed to produce the output IR |
| instructions; e.g., cloned once, replicated multiple times or widened |
| according to selected VF. |
| |
| :VPValue: |
| The base of VPlan's def-use relations class hierarchy. When instantiated, it |
| models a constant or a live-in Value in VPlan. It has users, which are of type |
| VPUser, but no operands. |
| |
| :VPUser: |
| A VPValue representing a general vertex in the def-use graph of VPlan. It has |
| operands which are of type VPValue. When instantiated, it represents a |
| live-out Instruction that exists outside VPlan. VPUser is similar in some |
| aspects to LLVM's User class. |
| |
| :VPInstruction: |
| A VPInstruction is both a VPRecipe and a VPUser. It models a single |
| VPlan-level instruction to be generated if the VPlan is executed, including |
| its opcode and possibly additional characteristics. It is the basis for |
| writing instruction-level analyses and optimizations in VPlan as creating, |
| replacing or moving VPInstructions record both def-use and scheduling |
| decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic |
| operations that enrich the Vectorizer's semantics. |
| |
| :VPTransformState: |
| Stores information used for generating output IR, passed from |
| LoopVectorizationPlanner to its selected VPlan for execution, and used to pass |
| additional information down to VPBlocks and VPRecipes. |
| |
| The Planning Process and VPlan Roadmap |
| ====================================== |
| |
| Transforming the Loop Vectorizer to use VPlan follows a staged approach. First, |
| VPlan is used to record the final vectorization decisions, and to execute them: |
| the Hierarchical CFG models the planned control-flow, and Recipes capture |
| decisions taken inside basic-blocks. Next, VPlan will be used also as the basis |
| for taking these decisions, effectively turning them into a series of |
| VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process |
| itself including cost-based analyses for making these decisions, to fully |
| support compositional and iterative decision making. |
| |
| Some decisions are local to an instruction in the loop, such as whether to widen |
| it into a vector instruction or replicate it, keeping the generated instructions |
| in place. Other decisions, however, involve moving instructions, replacing them |
| with other instructions, and/or introducing new instructions. For example, a |
| cast may sink past a later instruction and be widened to handle first-order |
| recurrence; an interleave group of strided gathers or scatters may effectively |
| move to one place where they are replaced with shuffles and a common wide vector |
| load or store; new instructions may be introduced to compute masks, shuffle the |
| elements of vectors, and pack scalar values into vectors or vice-versa. |
| |
| In order for VPlan to support making instruction-level decisions and analyses, |
| it needs to model the relevant instructions along with their def/use relations. |
| This too follows a staged approach: first, the new instructions that compute |
| masks are modeled as VPInstructions, along with their induced def/use subgraph. |
| This effectively models masks in VPlan, facilitating VPlan-based predication. |
| Next, the logic embedded within each Recipe for generating its instructions at |
| VPlan execution time, will instead take part in the planning process by modeling |
| them as VPInstructions. Finally, only logic that applies to instructions as a |
| group will remain in Recipes, such as interleave groups and potentially other |
| idiom groups having synergistic cost. |
| |
| Related LLVM components |
| ----------------------- |
| 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP |
| tree, where TSLP [3]_ adds Plan Step 2.b. |
| |
| 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by |
| Polly [7]_. |
| |
| 3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of |
| the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. |
| |
| References |
| ---------- |
| .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit |
| Nuzman and Ayal Zaks, PACT 2008. |
| |
| .. [2] "Proposal for function vectorization and loop vectorization with function |
| calls", Xinmin Tian, [`cfe-dev |
| <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., |
| March 2, 2016. |
| See also `review <https://reviews.llvm.org/D22792>`_. |
| |
| .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios |
| Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. |
| |
| .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization |
| overhead", Hao Zhou and Jingling Xue, CGO 2016. |
| |
| .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and |
| Brian Koblenz, PLDI 1991 |
| |
| .. [6] "Structural analysis: A new approach to flow analysis in optimizing |
| compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 |
| |
| .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma |
| thesis, 2011. |
| |
| .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, |
| European LLVM Developers' Meeting 2017. |
| |
| .. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop |
| Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. |