| ===================================== |
| Performance Tips for Frontend Authors |
| ===================================== |
| |
| .. contents:: |
| :local: |
| :depth: 2 |
| |
| Abstract |
| ======== |
| |
| The intended audience of this document is developers of language frontends |
| targeting LLVM IR. This document is home to a collection of tips on how to |
| generate IR that optimizes well. |
| |
| IR Best Practices |
| ================= |
| |
| As with any optimizer, LLVM has its strengths and weaknesses. In some cases, |
| surprisingly small changes in the source IR can have a large effect on the |
| generated code. |
| |
| Beyond the specific items on the list below, it's worth noting that the most |
| mature frontend for LLVM is Clang. As a result, the further your IR gets from what Clang might emit, the less likely it is to be effectively optimized. It |
| can often be useful to write a quick C program with the semantics you're trying |
| to model and see what decisions Clang's IRGen makes about what IR to emit. |
| Studying Clang's CodeGen directory can also be a good source of ideas. Note |
| that Clang and LLVM are explicitly version locked so you'll need to make sure |
| you're using a Clang built from the same svn revision or release as the LLVM |
| library you're using. As always, it's *strongly* recommended that you track |
| tip of tree development, particularly during bring up of a new project. |
| |
| The Basics |
| ^^^^^^^^^^^ |
| |
| #. Make sure that your Modules contain both a data layout specification and |
| target triple. Without these pieces, non of the target specific optimization |
| will be enabled. This can have a major effect on the generated code quality. |
| |
| #. For each function or global emitted, use the most private linkage type |
| possible (private, internal or linkonce_odr preferably). Doing so will |
| make LLVM's inter-procedural optimizations much more effective. |
| |
| #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds |
| of predecessors). Among other issues, the register allocator is known to |
| perform badly with confronted with such structures. The only exception to |
| this guidance is that a unified return block with high in-degree is fine. |
| |
| Use of allocas |
| ^^^^^^^^^^^^^^ |
| |
| An alloca instruction can be used to represent a function scoped stack slot, |
| but can also represent dynamic frame expansion. When representing function |
| scoped variables or locations, placing alloca instructions at the beginning of |
| the entry block should be preferred. In particular, place them before any |
| call instructions. Call instructions might get inlined and replaced with |
| multiple basic blocks. The end result is that a following alloca instruction |
| would no longer be in the entry basic block afterward. |
| |
| The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt |
| to eliminate alloca instructions that are in the entry basic block. Given |
| SSA is the canonical form expected by much of the optimizer; if allocas can |
| not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less |
| effective than it could be. |
| |
| Avoid loads and stores of large aggregate type |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| LLVM currently does not optimize well loads and stores of large :ref:`aggregate |
| types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider |
| loading individual fields from memory. |
| |
| Aggregates that are smaller than the largest (performant) load or store |
| instruction supported by the targeted hardware are well supported. These can |
| be an effective way to represent collections of small packed fields. |
| |
| Prefer zext over sext when legal |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| On some architectures (X86_64 is one), sign extension can involve an extra |
| instruction whereas zero extension can be folded into a load. LLVM will try to |
| replace a sext with a zext when it can be proven safe, but if you have |
| information in your source language about the range of a integer value, it can |
| be profitable to use a zext rather than a sext. |
| |
| Alternatively, you can :ref:`specify the range of the value using metadata |
| <range-metadata>` and LLVM can do the sext to zext conversion for you. |
| |
| Zext GEP indices to machine register width |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Internally, LLVM often promotes the width of GEP indices to machine register |
| width. When it does so, it will default to using sign extension (sext) |
| operations for safety. If your source language provides information about |
| the range of the index, you may wish to manually extend indices to machine |
| register width using a zext instruction. |
| |
| When to specify alignment |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| LLVM will always generate correct code if you don’t specify alignment, but may |
| generate inefficient code. For example, if you are targeting MIPS (or older |
| ARM ISAs) then the hardware does not handle unaligned loads and stores, and |
| so you will enter a trap-and-emulate path if you do a load or store with |
| lower-than-natural alignment. To avoid this, LLVM will emit a slower |
| sequence of loads, shifts and masks (or load-right + load-left on MIPS) for |
| all cases where the load / store does not have a sufficiently high alignment |
| in the IR. |
| |
| The alignment is used to guarantee the alignment on allocas and globals, |
| though in most cases this is unnecessary (most targets have a sufficiently |
| high default alignment that they’ll be fine). It is also used to provide a |
| contract to the back end saying ‘either this load/store has this alignment, or |
| it is undefined behavior’. This means that the back end is free to emit |
| instructions that rely on that alignment (and mid-level optimizers are free to |
| perform transforms that require that alignment). For x86, it doesn’t make |
| much difference, as almost all instructions are alignment-independent. For |
| MIPS, it can make a big difference. |
| |
| Note that if your loads and stores are atomic, the backend will be unable to |
| lower an under aligned access into a sequence of natively aligned accesses. |
| As a result, alignment is mandatory for atomic loads and stores. |
| |
| Other Things to Consider |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing |
| analysis), prefer GEPs |
| |
| #. Prefer globals over inttoptr of a constant address - this gives you |
| dereferencability information. In MCJIT, use getSymbolAddress to provide |
| actual address. |
| |
| #. Be wary of ordered and atomic memory operations. They are hard to optimize |
| and may not be well optimized by the current optimizer. Depending on your |
| source language, you may consider using fences instead. |
| |
| #. If calling a function which is known to throw an exception (unwind), use |
| an invoke with a normal destination which contains an unreachable |
| instruction. This form conveys to the optimizer that the call returns |
| abnormally. For an invoke which neither returns normally or requires unwind |
| code in the current function, you can use a noreturn call instruction if |
| desired. This is generally not required because the optimizer will convert |
| an invoke with an unreachable unwind destination to a call instruction. |
| |
| #. Use profile metadata to indicate statically known cold paths, even if |
| dynamic profiling information is not available. This can make a large |
| difference in code placement and thus the performance of tight loops. |
| |
| #. When generating code for loops, try to avoid terminating the header block of |
| the loop earlier than necessary. If the terminator of the loop header |
| block is a loop exiting conditional branch, the effectiveness of LICM will |
| be limited for loads not in the header. (This is due to the fact that LLVM |
| may not know such a load is safe to speculatively execute and thus can't |
| lift an otherwise loop invariant load unless it can prove the exiting |
| condition is not taken.) It can be profitable, in some cases, to emit such |
| instructions into the header even if they are not used along a rarely |
| executed path that exits the loop. This guidance specifically does not |
| apply if the condition which terminates the loop header is itself invariant, |
| or can be easily discharged by inspecting the loop index variables. |
| |
| #. In hot loops, consider duplicating instructions from small basic blocks |
| which end in highly predictable terminators into their successor blocks. |
| If a hot successor block contains instructions which can be vectorized |
| with the duplicated ones, this can provide a noticeable throughput |
| improvement. Note that this is not always profitable and does involve a |
| potentially large increase in code size. |
| |
| #. When checking a value against a constant, emit the check using a consistent |
| comparison type. The GVN pass *will* optimize redundant equalities even if |
| the type of comparison is inverted, but GVN only runs late in the pipeline. |
| As a result, you may miss the opportunity to run other important |
| optimizations. Improvements to EarlyCSE to remove this issue are tracked in |
| Bug 23333. |
| |
| #. Avoid using arithmetic intrinsics unless you are *required* by your source |
| language specification to emit a particular code sequence. The optimizer |
| is quite good at reasoning about general control flow and arithmetic, it is |
| not anywhere near as strong at reasoning about the various intrinsics. If |
| profitable for code generation purposes, the optimizer will likely form the |
| intrinsics itself late in the optimization pipeline. It is *very* rarely |
| profitable to emit these directly in the language frontend. This item |
| explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`. |
| |
| #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've |
| established that a) there's no other way to express the given fact and b) |
| that fact is critical for optimization purposes. Assumes are a great |
| prototyping mechanism, but they can have negative effects on both compile |
| time and optimization effectiveness. The former is fixable with enough |
| effort, but the later is fairly fundamental to their designed purpose. |
| |
| |
| Describing Language Specific Properties |
| ======================================= |
| |
| When translating a source language to LLVM, finding ways to express concepts |
| and guarantees available in your source language which are not natively |
| provided by LLVM IR will greatly improve LLVM's ability to optimize your code. |
| As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes |
| a long way to assisting the optimizer in reasoning about loop induction |
| variables and thus generating more optimal code for loops. |
| |
| The LLVM LangRef includes a number of mechanisms for annotating the IR with |
| additional semantic information. It is *strongly* recommended that you become |
| highly familiar with this document. The list below is intended to highlight a |
| couple of items of particular interest, but is by no means exhaustive. |
| |
| Restricted Operation Semantics |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| #. Add nsw/nuw flags as appropriate. Reasoning about overflow is |
| generally hard for an optimizer so providing these facts from the frontend |
| can be very impactful. |
| |
| #. Use fast-math flags on floating point operations if legal. If you don't |
| need strict IEEE floating point semantics, there are a number of additional |
| optimizations that can be performed. This can be highly impactful for |
| floating point intensive computations. |
| |
| Describing Aliasing Properties |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| #. Add noalias/align/dereferenceable/nonnull to function arguments and return |
| values as appropriate |
| |
| #. Use pointer aliasing metadata, especially tbaa metadata, to communicate |
| otherwise-non-deducible pointer aliasing facts |
| |
| #. Use inbounds on geps. This can help to disambiguate some aliasing queries. |
| |
| |
| Modeling Memory Effects |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| #. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when |
| known. The optimizer will try to infer these flags, but may not always be |
| able to. Manual annotations are particularly important for external |
| functions that the optimizer can not analyze. |
| |
| #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end |
| intrinsics where possible. Common profitable uses are for stack like data |
| structures (thus allowing dead store elimination) and for describing |
| life times of allocas (thus allowing smaller stack sizes). |
| |
| #. Mark invariant locations using !invariant.load and TBAA's constant flags |
| |
| Pass Ordering |
| ^^^^^^^^^^^^^ |
| |
| One of the most common mistakes made by new language frontend projects is to |
| use the existing -O2 or -O3 pass pipelines as is. These pass pipelines make a |
| good starting point for an optimizing compiler for any language, but they have |
| been carefully tuned for C and C++, not your target language. You will almost |
| certainly need to use a custom pass order to achieve optimal performance. A |
| couple specific suggestions: |
| |
| #. For languages with numerous rarely executed guard conditions (e.g. null |
| checks, type checks, range checks) consider adding an extra execution or |
| two of LoopUnswith and LICM to your pass order. The standard pass order, |
| which is tuned for C and C++ applications, may not be sufficient to remove |
| all dischargeable checks from loops. |
| |
| #. If you language uses range checks, consider using the IRCE pass. It is not |
| currently part of the standard pass order. |
| |
| #. A useful sanity check to run is to run your optimized IR back through the |
| -O2 pipeline again. If you see noticeable improvement in the resulting IR, |
| you likely need to adjust your pass order. |
| |
| |
| I Still Can't Find What I'm Looking For |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| If you didn't find what you were looking for above, consider proposing an piece |
| of metadata which provides the optimization hint you need. Such extensions are |
| relatively common and are generally well received by the community. You will |
| need to ensure that your proposal is sufficiently general so that it benefits |
| others if you wish to contribute it upstream. |
| |
| You should also consider describing the problem you're facing on `llvm-dev |
| <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice. |
| It's entirely possible someone has encountered your problem before and can |
| give good advice. If there are multiple interested parties, that also |
| increases the chances that a metadata extension would be well received by the |
| community as a whole. |
| |
| Adding to this document |
| ======================= |
| |
| If you run across a case that you feel deserves to be covered here, please send |
| a patch to `llvm-commits |
| <http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review. |
| |
| If you have questions on these items, please direct them to `llvm-dev |
| <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_. The more relevant |
| context you are able to give to your question, the more likely it is to be |
| answered. |
| |