| llvm-mca - LLVM Machine Code Analyzer |
| ===================================== |
| |
| SYNOPSIS |
| -------- |
| |
| :program:`llvm-mca` [*options*] [input] |
| |
| DESCRIPTION |
| ----------- |
| |
| :program:`llvm-mca` is a performance analysis tool that uses information |
| available in LLVM (e.g. scheduling models) to statically measure the performance |
| of machine code in a specific CPU. |
| |
| Performance is measured in terms of throughput as well as processor resource |
| consumption. The tool currently works for processors with an out-of-order |
| backend, for which there is a scheduling model available in LLVM. |
| |
| The main goal of this tool is not just to predict the performance of the code |
| when run on the target, but also help with diagnosing potential performance |
| issues. |
| |
| Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle |
| (IPC), as well as hardware resource pressure. The analysis and reporting style |
| were inspired by the IACA tool from Intel. |
| |
| :program:`llvm-mca` allows the usage of special code comments to mark regions of |
| the assembly code to be analyzed. A comment starting with substring |
| ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with |
| substring ``LLVM-MCA-END`` marks the end of a code region. For example: |
| |
| .. code-block:: none |
| |
| # LLVM-MCA-BEGIN My Code Region |
| ... |
| # LLVM-MCA-END |
| |
| Multiple regions can be specified provided that they do not overlap. A code |
| region can have an optional description. If no user-defined region is specified, |
| then :program:`llvm-mca` assumes a default region which contains every |
| instruction in the input file. Every region is analyzed in isolation, and the |
| final performance report is the union of all the reports generated for every |
| code region. |
| |
| Inline assembly directives may be used from source code to annotate the |
| assembly text: |
| |
| .. code-block:: c++ |
| |
| int foo(int a, int b) { |
| __asm volatile("# LLVM-MCA-BEGIN foo"); |
| a += 42; |
| __asm volatile("# LLVM-MCA-END"); |
| a *= b; |
| return a; |
| } |
| |
| So for example, you can compile code with clang, output assembly, and pipe it |
| directly into llvm-mca for analysis: |
| |
| .. code-block:: bash |
| |
| $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 |
| |
| Or for Intel syntax: |
| |
| .. code-block:: bash |
| |
| $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 |
| |
| OPTIONS |
| ------- |
| |
| If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard |
| input. Otherwise, it will read from the specified filename. |
| |
| If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output |
| to standard output if the input is from standard input. If the :option:`-o` |
| option specifies "``-``", then the output will also be sent to standard output. |
| |
| |
| .. option:: -help |
| |
| Print a summary of command line options. |
| |
| .. option:: -mtriple=<target triple> |
| |
| Specify a target triple string. |
| |
| .. option:: -march=<arch> |
| |
| Specify the architecture for which to analyze the code. It defaults to the |
| host default target. |
| |
| .. option:: -mcpu=<cpuname> |
| |
| Specify the processor for which to analyze the code. By default, the cpu name |
| is autodetected from the host. |
| |
| .. option:: -output-asm-variant=<variant id> |
| |
| Specify the output assembly variant for the report generated by the tool. |
| On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables |
| the AT&T (vic. Intel) assembly format for the code printed out by the tool in |
| the analysis report. |
| |
| .. option:: -dispatch=<width> |
| |
| Specify a different dispatch width for the processor. The dispatch width |
| defaults to field 'IssueWidth' in the processor scheduling model. If width is |
| zero, then the default dispatch width is used. |
| |
| .. option:: -register-file-size=<size> |
| |
| Specify the size of the register file. When specified, this flag limits how |
| many physical registers are available for register renaming purposes. A value |
| of zero for this flag means "unlimited number of physical registers". |
| |
| .. option:: -iterations=<number of iterations> |
| |
| Specify the number of iterations to run. If this flag is set to 0, then the |
| tool sets the number of iterations to a default value (i.e. 100). |
| |
| .. option:: -noalias=<bool> |
| |
| If set, the tool assumes that loads and stores don't alias. This is the |
| default behavior. |
| |
| .. option:: -lqueue=<load queue size> |
| |
| Specify the size of the load queue in the load/store unit emulated by the tool. |
| By default, the tool assumes an unbound number of entries in the load queue. |
| A value of zero for this flag is ignored, and the default load queue size is |
| used instead. |
| |
| .. option:: -squeue=<store queue size> |
| |
| Specify the size of the store queue in the load/store unit emulated by the |
| tool. By default, the tool assumes an unbound number of entries in the store |
| queue. A value of zero for this flag is ignored, and the default store queue |
| size is used instead. |
| |
| .. option:: -timeline |
| |
| Enable the timeline view. |
| |
| .. option:: -timeline-max-iterations=<iterations> |
| |
| Limit the number of iterations to print in the timeline view. By default, the |
| timeline view prints information for up to 10 iterations. |
| |
| .. option:: -timeline-max-cycles=<cycles> |
| |
| Limit the number of cycles in the timeline view. By default, the number of |
| cycles is set to 80. |
| |
| .. option:: -resource-pressure |
| |
| Enable the resource pressure view. This is enabled by default. |
| |
| .. option:: -register-file-stats |
| |
| Enable register file usage statistics. |
| |
| .. option:: -dispatch-stats |
| |
| Enable extra dispatch statistics. This view collects and analyzes instruction |
| dispatch events, as well as static/dynamic dispatch stall events. This view |
| is disabled by default. |
| |
| .. option:: -scheduler-stats |
| |
| Enable extra scheduler statistics. This view collects and analyzes instruction |
| issue events. This view is disabled by default. |
| |
| .. option:: -retire-stats |
| |
| Enable extra retire control unit statistics. This view is disabled by default. |
| |
| .. option:: -instruction-info |
| |
| Enable the instruction info view. This is enabled by default. |
| |
| .. option:: -all-stats |
| |
| Print all hardware statistics. This enables extra statistics related to the |
| dispatch logic, the hardware schedulers, the register file(s), and the retire |
| control unit. This option is disabled by default. |
| |
| .. option:: -all-views |
| |
| Enable all the view. |
| |
| .. option:: -instruction-tables |
| |
| Prints resource pressure information based on the static information |
| available from the processor model. This differs from the resource pressure |
| view because it doesn't require that the code is simulated. It instead prints |
| the theoretical uniform distribution of resource pressure for every |
| instruction in sequence. |
| |
| |
| EXIT STATUS |
| ----------- |
| |
| :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed |
| to standard error, and the tool returns 1. |
| |
| HOW LLVM-MCA WORKS |
| ------------------ |
| |
| :program:`llvm-mca` takes assembly code as input. The assembly code is parsed |
| into a sequence of MCInst with the help of the existing LLVM target assembly |
| parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module |
| to generate a performance report. |
| |
| The Pipeline module simulates the execution of the machine code sequence in a |
| loop of iterations (default is 100). During this process, the pipeline collects |
| a number of execution related statistics. At the end of this process, the |
| pipeline generates and prints a report from the collected statistics. |
| |
| Here is an example of a performance report generated by the tool for a |
| dot-product of two packed float vectors of four elements. The analysis is |
| conducted for target x86, cpu btver2. The following result can be produced via |
| the following command using the example located at |
| ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: |
| |
| .. code-block:: bash |
| |
| $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s |
| |
| .. code-block:: none |
| |
| Iterations: 300 |
| Instructions: 900 |
| Total Cycles: 610 |
| Dispatch Width: 2 |
| IPC: 1.48 |
| Block RThroughput: 2.0 |
| |
| |
| Instruction Info: |
| [1]: #uOps |
| [2]: Latency |
| [3]: RThroughput |
| [4]: MayLoad |
| [5]: MayStore |
| [6]: HasSideEffects (U) |
| |
| [1] [2] [3] [4] [5] [6] Instructions: |
| 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 |
| 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 |
| 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 |
| |
| |
| Resources: |
| [0] - JALU0 |
| [1] - JALU1 |
| [2] - JDiv |
| [3] - JFPA |
| [4] - JFPM |
| [5] - JFPU0 |
| [6] - JFPU1 |
| [7] - JLAGU |
| [8] - JMul |
| [9] - JSAGU |
| [10] - JSTC |
| [11] - JVALU0 |
| [12] - JVALU1 |
| [13] - JVIMUL |
| |
| |
| Resource pressure per iteration: |
| [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] |
| - - - 2.00 1.00 2.00 1.00 - - - - - - - |
| |
| Resource pressure by instruction: |
| [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: |
| - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 |
| - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 |
| - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 |
| |
| According to this report, the dot-product kernel has been executed 300 times, |
| for a total of 900 dynamically executed instructions. |
| |
| The report is structured in three main sections. The first section collects a |
| few performance numbers; the goal of this section is to give a very quick |
| overview of the performance throughput. In this example, the two important |
| performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal |
| Throughput). |
| |
| IPC is computed dividing the total number of simulated instructions by the total |
| number of cycles. A delta between Dispatch Width and IPC is an indicator of a |
| performance issue. In the absence of loop-carried data dependencies, the |
| observed IPC tends to a theoretical maximum which can be computed by dividing |
| the number of instructions of a single iteration by the *Block RThroughput*. |
| |
| IPC is bounded from above by the dispatch width. That is because the dispatch |
| width limits the maximum size of a dispatch group. IPC is also limited by the |
| amount of hardware parallelism. The availability of hardware resources affects |
| the resource pressure distribution, and it limits the number of instructions |
| that can be executed in parallel every cycle. A delta between Dispatch |
| Width and the theoretical maximum IPC is an indicator of a performance |
| bottleneck caused by the lack of hardware resources. In general, the lower the |
| Block RThroughput, the better. |
| |
| In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since |
| there are no loop-carried dependencies, the observed IPC is expected to approach |
| 1.50 when the number of iterations tends to infinity. The delta between the |
| Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of |
| a performance bottleneck caused by the lack of hardware resources, and the |
| *Resource pressure view* can help to identify the problematic resource usage. |
| |
| The second section of the report shows the latency and reciprocal |
| throughput of every instruction in the sequence. That section also reports |
| extra information related to the number of micro opcodes, and opcode properties |
| (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). |
| |
| The third section is the *Resource pressure view*. This view reports |
| the average number of resource cycles consumed every iteration by instructions |
| for every processor resource unit available on the target. Information is |
| structured in two tables. The first table reports the number of resource cycles |
| spent on average every iteration. The second table correlates the resource |
| cycles to the machine instruction in the sequence. For example, every iteration |
| of the instruction vmulps always executes on resource unit [6] |
| (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle |
| per iteration. Note that on AMD Jaguar, vector floating-point multiply can |
| only be issued to pipeline JFPU1, while horizontal floating-point additions can |
| only be issued to pipeline JFPU0. |
| |
| The resource pressure view helps with identifying bottlenecks caused by high |
| usage of specific hardware resources. Situations with resource pressure mainly |
| concentrated on a few resources should, in general, be avoided. Ideally, |
| pressure should be uniformly distributed between multiple resources. |
| |
| Timeline View |
| ^^^^^^^^^^^^^ |
| The timeline view produces a detailed report of each instruction's state |
| transitions through an instruction pipeline. This view is enabled by the |
| command line option ``-timeline``. As instructions transition through the |
| various stages of the pipeline, their states are depicted in the view report. |
| These states are represented by the following characters: |
| |
| * D : Instruction dispatched. |
| * e : Instruction executing. |
| * E : Instruction executed. |
| * R : Instruction retired. |
| * = : Instruction already dispatched, waiting to be executed. |
| * \- : Instruction executed, waiting to be retired. |
| |
| Below is the timeline view for a subset of the dot-product example located in |
| ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by |
| :program:`llvm-mca` using the following command: |
| |
| .. code-block:: bash |
| |
| $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s |
| |
| .. code-block:: none |
| |
| Timeline view: |
| 012345 |
| Index 0123456789 |
| |
| [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 |
| [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 |
| [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 |
| [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 |
| [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 |
| |
| |
| Average Wait times (based on the timeline view): |
| [0]: Executions |
| [1]: Average time spent waiting in a scheduler's queue |
| [2]: Average time spent waiting in a scheduler's queue while ready |
| [3]: Average time elapsed from WB until retire stage |
| |
| [0] [1] [2] [3] |
| 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 |
| 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 |
| 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 |
| |
| The timeline view is interesting because it shows instruction state changes |
| during execution. It also gives an idea of how the tool processes instructions |
| executed on the target, and how their timing information might be calculated. |
| |
| The timeline view is structured in two tables. The first table shows |
| instructions changing state over time (measured in cycles); the second table |
| (named *Average Wait times*) reports useful timing statistics, which should |
| help diagnose performance bottlenecks caused by long data dependencies and |
| sub-optimal usage of hardware resources. |
| |
| An instruction in the timeline view is identified by a pair of indices, where |
| the first index identifies an iteration, and the second index is the |
| instruction index (i.e., where it appears in the code sequence). Since this |
| example was generated using 3 iterations: ``-iterations=3``, the iteration |
| indices range from 0-2 inclusively. |
| |
| Excluding the first and last column, the remaining columns are in cycles. |
| Cycles are numbered sequentially starting from 0. |
| |
| From the example output above, we know the following: |
| |
| * Instruction [1,0] was dispatched at cycle 1. |
| * Instruction [1,0] started executing at cycle 2. |
| * Instruction [1,0] reached the write back stage at cycle 4. |
| * Instruction [1,0] was retired at cycle 10. |
| |
| Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the |
| scheduler's queue for the operands to become available. By the time vmulps is |
| dispatched, operands are already available, and pipeline JFPU1 is ready to |
| serve another instruction. So the instruction can be immediately issued on the |
| JFPU1 pipeline. That is demonstrated by the fact that the instruction only |
| spent 1cy in the scheduler's queue. |
| |
| There is a gap of 5 cycles between the write-back stage and the retire event. |
| That is because instructions must retire in program order, so [1,0] has to wait |
| for [0,2] to be retired first (i.e., it has to wait until cycle 10). |
| |
| In the example, all instructions are in a RAW (Read After Write) dependency |
| chain. Register %xmm2 written by vmulps is immediately used by the first |
| vhaddps, and register %xmm3 written by the first vhaddps is used by the second |
| vhaddps. Long data dependencies negatively impact the ILP (Instruction Level |
| Parallelism). |
| |
| In the dot-product example, there are anti-dependencies introduced by |
| instructions from different iterations. However, those dependencies can be |
| removed at register renaming stage (at the cost of allocating register aliases, |
| and therefore consuming physical registers). |
| |
| Table *Average Wait times* helps diagnose performance issues that are caused by |
| the presence of long latency instructions and potentially long data dependencies |
| which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at |
| least 1cy between the dispatch event and the issue event. |
| |
| When the performance is limited by data dependencies and/or long latency |
| instructions, the number of cycles spent while in the *ready* state is expected |
| to be very small when compared with the total number of cycles spent in the |
| scheduler's queue. The difference between the two counters is a good indicator |
| of how large of an impact data dependencies had on the execution of the |
| instructions. When performance is mostly limited by the lack of hardware |
| resources, the delta between the two counters is small. However, the number of |
| cycles spent in the queue tends to be larger (i.e., more than 1-3cy), |
| especially when compared to other low latency instructions. |
| |
| Extra Statistics to Further Diagnose Performance Issues |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| The ``-all-stats`` command line option enables extra statistics and performance |
| counters for the dispatch logic, the reorder buffer, the retire control unit, |
| and the register file. |
| |
| Below is an example of ``-all-stats`` output generated by MCA for the |
| dot-product example discussed in the previous sections. |
| |
| .. code-block:: none |
| |
| Dynamic Dispatch Stall Cycles: |
| RAT - Register unavailable: 0 |
| RCU - Retire tokens unavailable: 0 |
| SCHEDQ - Scheduler full: 272 |
| LQ - Load queue full: 0 |
| SQ - Store queue full: 0 |
| GROUP - Static restrictions on the dispatch group: 0 |
| |
| |
| Dispatch Logic - number of cycles where we saw N instructions dispatched: |
| [# dispatched], [# cycles] |
| 0, 24 (3.9%) |
| 1, 272 (44.6%) |
| 2, 314 (51.5%) |
| |
| |
| Schedulers - number of cycles where we saw N instructions issued: |
| [# issued], [# cycles] |
| 0, 7 (1.1%) |
| 1, 306 (50.2%) |
| 2, 297 (48.7%) |
| |
| |
| Scheduler's queue usage: |
| JALU01, 0/20 |
| JFPU01, 18/18 |
| JLSAGU, 0/12 |
| |
| |
| Retire Control Unit - number of cycles where we saw N instructions retired: |
| [# retired], [# cycles] |
| 0, 109 (17.9%) |
| 1, 102 (16.7%) |
| 2, 399 (65.4%) |
| |
| |
| Register File statistics: |
| Total number of mappings created: 900 |
| Max number of mappings used: 35 |
| |
| * Register File #1 -- JFpuPRF: |
| Number of physical registers: 72 |
| Total number of mappings created: 900 |
| Max number of mappings used: 35 |
| |
| * Register File #2 -- JIntegerPRF: |
| Number of physical registers: 64 |
| Total number of mappings created: 0 |
| Max number of mappings used: 0 |
| |
| If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for |
| SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch |
| logic is unable to dispatch a group of two instructions because the scheduler's |
| queue is full. |
| |
| Looking at the *Dispatch Logic* table, we see that the pipeline was only able |
| to dispatch two instructions 51.5% of the time. The dispatch group was limited |
| to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The |
| dispatch statistics are displayed by either using the command option |
| ``-all-stats`` or ``-dispatch-stats``. |
| |
| The next table, *Schedulers*, presents a histogram displaying a count, |
| representing the number of instructions issued on some number of cycles. In |
| this case, of the 610 simulated cycles, single |
| instructions were issued 306 times (50.2%) and there were 7 cycles where |
| no instructions were issued. |
| |
| The *Scheduler's queue usage* table shows that the maximum number of buffer |
| entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 |
| reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements |
| three schedulers: |
| |
| * JALU01 - A scheduler for ALU instructions. |
| * JFPU01 - A scheduler floating point operations. |
| * JLSAGU - A scheduler for address generation. |
| |
| The dot-product is a kernel of three floating point instructions (a vector |
| multiply followed by two horizontal adds). That explains why only the floating |
| point scheduler appears to be used. |
| |
| A full scheduler queue is either caused by data dependency chains or by a |
| sub-optimal usage of hardware resources. Sometimes, resource pressure can be |
| mitigated by rewriting the kernel using different instructions that consume |
| different scheduler resources. Schedulers with a small queue are less resilient |
| to bottlenecks caused by the presence of long data dependencies. |
| The scheduler statistics are displayed by |
| using the command option ``-all-stats`` or ``-scheduler-stats``. |
| |
| The next table, *Retire Control Unit*, presents a histogram displaying a count, |
| representing the number of instructions retired on some number of cycles. In |
| this case, of the 610 simulated cycles, two instructions were retired during |
| the same cycle 399 times (65.4%) and there were 109 cycles where no |
| instructions were retired. The retire statistics are displayed by using the |
| command option ``-all-stats`` or ``-retire-stats``. |
| |
| The last table presented is *Register File statistics*. Each physical register |
| file (PRF) used by the pipeline is presented in this table. In the case of AMD |
| Jaguar, there are two register files, one for floating-point registers |
| (JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of |
| the 900 instructions processed, there were 900 mappings created. Since this |
| dot-product example utilized only floating point registers, the JFPuPRF was |
| responsible for creating the 900 mappings. However, we see that the pipeline |
| only used a maximum of 35 of 72 available register slots at any given time. We |
| can conclude that the floating point PRF was the only register file used for |
| the example, and that it was never resource constrained. The register file |
| statistics are displayed by using the command option ``-all-stats`` or |
| ``-register-file-stats``. |
| |
| In this example, we can conclude that the IPC is mostly limited by data |
| dependencies, and not by resource pressure. |
| |
| Instruction Flow |
| ^^^^^^^^^^^^^^^^ |
| This section describes the instruction flow through MCA's default out-of-order |
| pipeline, as well as the functional units involved in the process. |
| |
| The default pipeline implements the following sequence of stages used to |
| process instructions. |
| |
| * Dispatch (Instruction is dispatched to the schedulers). |
| * Issue (Instruction is issued to the processor pipelines). |
| * Write Back (Instruction is executed, and results are written back). |
| * Retire (Instruction is retired; writes are architecturally committed). |
| |
| The default pipeline only models the out-of-order portion of a processor. |
| Therefore, the instruction fetch and decode stages are not modeled. Performance |
| bottlenecks in the frontend are not diagnosed. MCA assumes that instructions |
| have all been decoded and placed into a queue. Also, MCA does not model branch |
| prediction. |
| |
| Instruction Dispatch |
| """""""""""""""""""" |
| During the dispatch stage, instructions are picked in program order from a |
| queue of already decoded instructions, and dispatched in groups to the |
| simulated hardware schedulers. |
| |
| The size of a dispatch group depends on the availability of the simulated |
| hardware resources. The processor dispatch width defaults to the value |
| of the ``IssueWidth`` in LLVM's scheduling model. |
| |
| An instruction can be dispatched if: |
| |
| * The size of the dispatch group is smaller than processor's dispatch width. |
| * There are enough entries in the reorder buffer. |
| * There are enough physical registers to do register renaming. |
| * The schedulers are not full. |
| |
| Scheduling models can optionally specify which register files are available on |
| the processor. MCA uses that information to initialize register file |
| descriptors. Users can limit the number of physical registers that are |
| globally available for register renaming by using the command option |
| ``-register-file-size``. A value of zero for this option means *unbounded*. |
| By knowing how many registers are available for renaming, MCA can predict |
| dispatch stalls caused by the lack of registers. |
| |
| The number of reorder buffer entries consumed by an instruction depends on the |
| number of micro-opcodes specified by the target scheduling model. MCA's |
| reorder buffer's purpose is to track the progress of instructions that are |
| "in-flight," and to retire instructions in program order. The number of |
| entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by |
| the target scheduling model. |
| |
| Instructions that are dispatched to the schedulers consume scheduler buffer |
| entries. :program:`llvm-mca` queries the scheduling model to determine the set |
| of buffered resources consumed by an instruction. Buffered resources are |
| treated like scheduler resources. |
| |
| Instruction Issue |
| """"""""""""""""" |
| Each processor scheduler implements a buffer of instructions. An instruction |
| has to wait in the scheduler's buffer until input register operands become |
| available. Only at that point, does the instruction becomes eligible for |
| execution and may be issued (potentially out-of-order) for execution. |
| Instruction latencies are computed by :program:`llvm-mca` with the help of the |
| scheduling model. |
| |
| :program:`llvm-mca`'s scheduler is designed to simulate multiple processor |
| schedulers. The scheduler is responsible for tracking data dependencies, and |
| dynamically selecting which processor resources are consumed by instructions. |
| It delegates the management of processor resource units and resource groups to a |
| resource manager. The resource manager is responsible for selecting resource |
| units that are consumed by instructions. For example, if an instruction |
| consumes 1cy of a resource group, the resource manager selects one of the |
| available units from the group; by default, the resource manager uses a |
| round-robin selector to guarantee that resource usage is uniformly distributed |
| between all units of a group. |
| |
| :program:`llvm-mca`'s scheduler implements three instruction queues: |
| |
| * WaitQueue: a queue of instructions whose operands are not ready. |
| * ReadyQueue: a queue of instructions ready to execute. |
| * IssuedQueue: a queue of instructions executing. |
| |
| Depending on the operand availability, instructions that are dispatched to the |
| scheduler are either placed into the WaitQueue or into the ReadyQueue. |
| |
| Every cycle, the scheduler checks if instructions can be moved from the |
| WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be |
| issued to the underlying pipelines. The algorithm prioritizes older instructions |
| over younger instructions. |
| |
| Write-Back and Retire Stage |
| """"""""""""""""""""""""""" |
| Issued instructions are moved from the ReadyQueue to the IssuedQueue. There, |
| instructions wait until they reach the write-back stage. At that point, they |
| get removed from the queue and the retire control unit is notified. |
| |
| When instructions are executed, the retire control unit flags the |
| instruction as "ready to retire." |
| |
| Instructions are retired in program order. The register file is notified of |
| the retirement so that it can free the physical registers that were allocated |
| for the instruction during the register renaming stage. |
| |
| Load/Store Unit and Memory Consistency Model |
| """""""""""""""""""""""""""""""""""""""""""" |
| To simulate an out-of-order execution of memory operations, :program:`llvm-mca` |
| utilizes a simulated load/store unit (LSUnit) to simulate the speculative |
| execution of loads and stores. |
| |
| Each load (or store) consumes an entry in the load (or store) queue. Users can |
| specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the |
| load and store queues respectively. The queues are unbounded by default. |
| |
| The LSUnit implements a relaxed consistency model for memory loads and stores. |
| The rules are: |
| |
| 1. A younger load is allowed to pass an older load only if there are no |
| intervening stores or barriers between the two loads. |
| 2. A younger load is allowed to pass an older store provided that the load does |
| not alias with the store. |
| 3. A younger store is not allowed to pass an older store. |
| 4. A younger store is not allowed to pass an older load. |
| |
| By default, the LSUnit optimistically assumes that loads do not alias |
| (`-noalias=true`) store operations. Under this assumption, younger loads are |
| always allowed to pass older stores. Essentially, the LSUnit does not attempt |
| to run any alias analysis to predict when loads and stores do not alias with |
| each other. |
| |
| Note that, in the case of write-combining memory, rule 3 could be relaxed to |
| allow reordering of non-aliasing store operations. That being said, at the |
| moment, there is no way to further relax the memory model (``-noalias`` is the |
| only option). Essentially, there is no option to specify a different memory |
| type (e.g., write-back, write-combining, write-through; etc.) and consequently |
| to weaken, or strengthen, the memory model. |
| |
| Other limitations are: |
| |
| * The LSUnit does not know when store-to-load forwarding may occur. |
| * The LSUnit does not know anything about cache hierarchy and memory types. |
| * The LSUnit does not know how to identify serializing operations and memory |
| fences. |
| |
| The LSUnit does not attempt to predict if a load or store hits or misses the L1 |
| cache. It only knows if an instruction "MayLoad" and/or "MayStore." For |
| loads, the scheduling model provides an "optimistic" load-to-use latency (which |
| usually matches the load-to-use latency for when there is a hit in the L1D). |
| |
| :program:`llvm-mca` does not know about serializing operations or memory-barrier |
| like instructions. The LSUnit conservatively assumes that an instruction which |
| has both "MayLoad" and unmodeled side effects behaves like a "soft" |
| load-barrier. That means, it serializes loads without forcing a flush of the |
| load queue. Similarly, instructions that "MayStore" and have unmodeled side |
| effects are treated like store barriers. A full memory barrier is a "MayLoad" |
| and "MayStore" instruction with unmodeled side effects. This is inaccurate, but |
| it is the best that we can do at the moment with the current information |
| available in LLVM. |
| |
| A load/store barrier consumes one entry of the load/store queue. A load/store |
| barrier enforces ordering of loads/stores. A younger load cannot pass a load |
| barrier. Also, a younger store cannot pass a store barrier. A younger load |
| has to wait for the memory/load barrier to execute. A load/store barrier is |
| "executed" when it becomes the oldest entry in the load/store queue(s). That |
| also means, by construction, all of the older loads/stores have been executed. |
| |
| In conclusion, the full set of load/store consistency rules are: |
| |
| #. A store may not pass a previous store. |
| #. A store may not pass a previous load (regardless of ``-noalias``). |
| #. A store has to wait until an older store barrier is fully executed. |
| #. A load may pass a previous load. |
| #. A load may not pass a previous store unless ``-noalias`` is set. |
| #. A load has to wait until an older load barrier is fully executed. |