John Bauman | 66b8ab2 | 2014-05-06 15:57:45 -0400 | [diff] [blame] | 1 | Wed Jun 25 15:13:51 CDT 2003
|
| 2 |
|
| 3 | First-level instrumentation
|
| 4 | ---------------------------
|
| 5 |
|
| 6 | We use opt to do Bytecode-to-bytecode instrumentation. Look at
|
| 7 | back-edges and insert llvm_first_trigger() function call which takes
|
| 8 | no arguments and no return value. This instrumentation is designed to
|
| 9 | be easy to remove, for instance by writing a NOP over the function
|
| 10 | call instruction.
|
| 11 |
|
| 12 | Keep count of every call to llvm_first_trigger(), and maintain
|
| 13 | counters in a map indexed by return address. If the trigger count
|
| 14 | exceeds a threshold, we identify a hot loop and perform second-level
|
| 15 | instrumentation on the hot loop region (the instructions between the
|
| 16 | target of the back-edge and the branch that causes the back-edge). We
|
| 17 | do not move code across basic-block boundaries.
|
| 18 |
|
| 19 |
|
| 20 | Second-level instrumentation
|
| 21 | ---------------------------
|
| 22 |
|
| 23 | We remove the first-level instrumentation by overwriting the CALL to
|
| 24 | llvm_first_trigger() with a NOP.
|
| 25 |
|
| 26 | The reoptimizer maintains a map between machine-code basic blocks and
|
| 27 | LLVM BasicBlock*s. We only keep track of paths that start at the
|
| 28 | first machine-code basic block of the hot loop region.
|
| 29 |
|
| 30 | How do we keep track of which edges to instrument, and which edges are
|
| 31 | exits from the hot region? 3 step process.
|
| 32 |
|
| 33 | 1) Do a DFS from the first machine-code basic block of the hot loop
|
| 34 | region and mark reachable edges.
|
| 35 |
|
| 36 | 2) Do a DFS from the last machine-code basic block of the hot loop
|
| 37 | region IGNORING back edges, and mark the edges which are reachable in
|
| 38 | 1) and also in 2) (i.e., must be reachable from both the start BB and
|
| 39 | the end BB of the hot region).
|
| 40 |
|
| 41 | 3) Mark BBs which end in edges that exit the hot region; we need to
|
| 42 | instrument these differently.
|
| 43 |
|
| 44 | Assume that there is 1 free register. On SPARC we use %g1, which LLC
|
| 45 | has agreed not to use. Shift a 1 into it at the beginning. At every
|
| 46 | edge which corresponds to a conditional branch, we shift 0 for not
|
| 47 | taken and 1 for taken into a register. This uniquely numbers the paths
|
| 48 | through the hot region. Silently fail if we need more than 64 bits.
|
| 49 |
|
| 50 | At the end BB we call countPath and increment the counter based on %g1
|
| 51 | and the return address of the countPath call. We keep track of the
|
| 52 | number of iterations and the number of paths. We only run this
|
| 53 | version 30 or 40 times.
|
| 54 |
|
| 55 | Find the BBs that total 90% or more of execution, and aggregate them
|
| 56 | together to form our trace. But we do not allow more than 5 paths; if
|
| 57 | we have more than 5 we take the ones that are executed the most. We
|
| 58 | verify our assumption that we picked a hot back-edge in first-level
|
| 59 | instrumentation, by making sure that the number of times we took an
|
| 60 | exit edge from the hot trace is less than 10% of the number of
|
| 61 | iterations.
|
| 62 |
|
| 63 | LLC has been taught to recognize llvm_first_trigger() calls and NOT
|
| 64 | generate saves and restores of caller-saved registers around these
|
| 65 | calls.
|
| 66 |
|
| 67 |
|
| 68 | Phase behavior
|
| 69 | --------------
|
| 70 |
|
| 71 | We turn off llvm_first_trigger() calls with NOPs, but this would hide
|
| 72 | phase behavior from us (when some funcs/traces stop being hot and
|
| 73 | others become hot.)
|
| 74 |
|
| 75 | We have a SIGALRM timer that counts time for us. Every time we get a
|
| 76 | SIGALRM we look at our priority queue of locations where we have
|
| 77 | removed llvm_first_trigger() calls. Each location is inserted along
|
| 78 | with a time when we will next turn instrumentation back on for that
|
| 79 | call site. If the time has arrived for a particular call site, we pop
|
| 80 | that off the prio. queue and turn instrumentation back on for that
|
| 81 | call site.
|
| 82 |
|
| 83 |
|
| 84 | Generating traces
|
| 85 | -----------------
|
| 86 |
|
| 87 | When we finally generate an optimized trace we first copy the code
|
| 88 | into the trace cache. This leaves us with 3 copies of the code: the
|
| 89 | original code, the instrumented code, and the optimized trace. The
|
| 90 | optimized trace does not have instrumentation. The original code and
|
| 91 | the instrumented code are modified to have a branch to the trace
|
| 92 | cache, where the optimized traces are kept.
|
| 93 |
|
| 94 | We copy the code from the original to the instrumentation version
|
| 95 | by tracing the LLVM-to-Machine code basic block map and then copying
|
| 96 | each machine code basic block we think is in the hot region into the
|
| 97 | trace cache. Then we instrument that code. The process is similar for
|
| 98 | generating the final optimized trace; we copy the same basic blocks
|
| 99 | because we might need to put in fixup code for exit BBs.
|
| 100 |
|
| 101 | LLVM basic blocks are not typically used in the Reoptimizer except
|
| 102 | for the mapping information.
|
| 103 |
|
| 104 | We are restricted to using single instructions to branch between the
|
| 105 | original code, trace, and instrumented code. So we have to keep the
|
| 106 | code copies in memory near the original code (they can't be far enough
|
| 107 | away that a single pc-relative branch would not work.) Malloc() or
|
| 108 | data region space is too far away. this impacts the design of the
|
| 109 | trace cache.
|
| 110 |
|
| 111 | We use a dummy function that is full of a bunch of for loops which we
|
| 112 | overwrite with trace-cache code. The trace manager keeps track of
|
| 113 | whether or not we have enough space in the trace cache, etc.
|
| 114 |
|
| 115 | The trace insertion routine takes an original start address, a vector
|
| 116 | of machine instructions representing the trace, index of branches and
|
| 117 | their corresponding absolute targets, and index of calls and their
|
| 118 | corresponding absolute targets.
|
| 119 |
|
| 120 | The trace insertion routine is responsible for inserting branches from
|
| 121 | the beginning of the original code to the beginning of the optimized
|
| 122 | trace. This is because at some point the trace cache may run out of
|
| 123 | space and it may have to evict a trace, at which point the branch to
|
| 124 | the trace would also have to be removed. It uses a round-robin
|
| 125 | replacement policy; we have found that this is almost as good as LRU
|
| 126 | and better than random (especially because of problems fitting the new
|
| 127 | trace in.)
|
| 128 |
|
| 129 | We cannot deal with discontiguous trace cache areas. The trace cache
|
| 130 | is supposed to be cache-line-aligned, but it is not page-aligned.
|
| 131 |
|
| 132 | We generate instrumentation traces and optimized traces into separate
|
| 133 | trace caches. We keep the instrumented code around because you don't
|
| 134 | want to delete a trace when you still might have to return to it
|
| 135 | (i.e., return from a llvm_first_trigger() or countPath() call.)
|
| 136 |
|
| 137 |
|