Ben Clayton | ac07ed8 | 2019-03-26 14:17:41 +0000 | [diff] [blame] | 1 | # Reactor Debug Info Generation |
| 2 | |
| 3 | ## Introduction |
| 4 | |
| 5 | Reactor produces Just In Time compiled dynamic executable code and can be used to JIT high performance functions specialized for runtime |
| 6 | configurations, or to even build a compiler. |
| 7 | |
| 8 | In order to debug executable code at a higher level than disassembly, source code files are required. |
| 9 | |
| 10 | Reactor has two potential sources of source code: |
| 11 | |
| 12 | 1. The C++ source code of the program that calls into Reactor. |
| 13 | 2. External source files read by the program and passed to Reactor. |
| 14 | |
| 15 | While case (2) is preferable for implementing a compiler, this is currently not |
| 16 | implemented. |
| 17 | |
| 18 | Reactor implements case (1) and this can be used by GDB to single line step and |
| 19 | inspect variables. |
| 20 | |
| 21 | ## Supported Platforms |
| 22 | |
| 23 | Currently: |
| 24 | |
| 25 | * Debug info generation is only supported on Linux with the LLVM 7 |
| 26 | backend. |
| 27 | * GDB is the only supported debugger. |
| 28 | * The program must be compiled with debug info iteself. |
| 29 | |
| 30 | ## Enabling |
| 31 | |
| 32 | Debug generation is enabled with `REACTOR_EMIT_DEBUG_INFO` CMake flag (defaults |
| 33 | to disabled). |
| 34 | |
| 35 | ## Implementation details |
| 36 | |
| 37 | ### Source Location |
| 38 | |
| 39 | All Reactor functions begin with a call to `RR_DEBUG_INFO_UPDATE_LOC()`, which calls into `rr::DebugInfo::EmitLocation()`. |
| 40 | |
| 41 | `rr::DebugInfo::EmitLocation()` calls `rr::DebugInfo::getCallerBacktrace()`, |
| 42 | which in turn uses [`libbacktrace`](https://github.com/ianlancetaylor/libbacktrace) |
| 43 | to unwind the stack and find the file, function and line of the caller. |
| 44 | |
| 45 | This information is passed to `llvm::IRBuilder<>::SetCurrentDebugLocation` |
| 46 | to emit source line information for the next LLVM instructions to be built. |
| 47 | |
| 48 | ### Variables |
| 49 | |
| 50 | There are 3 aspects to generating variable debug information: |
| 51 | |
| 52 | #### 1. Variable names |
| 53 | |
| 54 | Constructing a Reactor `LValue`: |
| 55 | |
| 56 | ```C++ |
| 57 | rr::Int a = 1; |
| 58 | ``` |
| 59 | |
| 60 | Will emit an LLVM `alloca` instruction to allocate the storage of the variable, |
| 61 | and emit another to initialize it to the constant `1`. While fluent, none of the |
| 62 | Reactor calls see the name of the C++ local variable "`a`", and the LLVM `alloca` |
| 63 | value gets a meaningless numerical value. |
| 64 | |
| 65 | There are two potential ways that Reactor can obtain the variable name: |
| 66 | |
| 67 | 1. Use the running executable's own debug information to examine the local |
| 68 | declaration and extract the local variable's name. |
| 69 | 2. Use the backtrace information to parse the name from the source file. |
| 70 | |
| 71 | While (1) is arguably a cleaner and more robust solution, (2) is |
| 72 | easier to implement and can work for the majority of use cases. |
| 73 | |
| 74 | (2) is the current solution implemented. |
| 75 | |
| 76 | `rr::DebugInfo::getOrParseFileTokens()` scans a source file line by line, and |
| 77 | uses a regular expression to look for patterns of `<type> <name>`. Matching is not |
| 78 | precise, but is adequate to find locals constructed with and without assignment. |
| 79 | |
| 80 | #### 2. Variable binding |
| 81 | |
| 82 | Given that we can find a variable name for a given source line, we need a way of |
| 83 | binding the LLVM values to the name. |
| 84 | |
| 85 | Given our trivial example: |
| 86 | |
| 87 | ```C++ |
| 88 | rr::Int a = 1 |
| 89 | ``` |
| 90 | |
| 91 | The `rr::Int` constructor calls `RR_DEBUG_INFO_EMIT_VAR()` passing the storage |
| 92 | value as single argument. `RR_DEBUG_INFO_EMIT_VAR()` performs the backtrace |
| 93 | to find the source file and line and uses the token information produced by |
| 94 | `rr::DebugInfo::getOrParseFileTokens()` to identify the variable name. |
| 95 | |
| 96 | However, things get a bit more complicated when there are multiple variables |
| 97 | being constructed on the same line. |
| 98 | |
| 99 | Take for example: |
| 100 | |
| 101 | ```C++ |
| 102 | rr::Int a = rr::Int(1) + rr::Int(2) |
| 103 | ``` |
| 104 | |
| 105 | Here we have 3 calls to the `rr::Int` constructor, each calling down |
| 106 | to `RR_DEBUG_INFO_EMIT_VAR()`. |
| 107 | |
| 108 | To disambiguate which of these should be bound to the variable name "`a`", |
| 109 | `rr::DebugInfo::EmitVariable()` buffers the binding into |
| 110 | `scope.pending` and the last binding for a given line is used by |
| 111 | `DebugInfo::emitPending()`. For variable construction and assignment, C++ |
| 112 | guarantees that the LHS is the last value to be constructed. |
| 113 | |
| 114 | This solution is not perfect. |
| 115 | |
| 116 | Multi-line expressions, multiple assignments on a single line, macro obfuscation |
| 117 | can all break variable bindings - however the majority of typical cases work. |
| 118 | |
| 119 | #### 3. Variable scope |
| 120 | |
| 121 | `rr::DebugInfo` maintains a stack of `llvm::DIScope`s and `llvm::DILocation`s |
| 122 | that mirrors the current backtrace for function being called. |
| 123 | |
| 124 | A synthetic call stack is produced by chaining `llvm::DILocation`s with |
| 125 | `InlinedAt`s. |
| 126 | |
| 127 | For example, at the declaration of `i`: |
| 128 | |
| 129 | ```C++ |
| 130 | void B() |
| 131 | { |
| 132 | rr::Int i; // <- here |
| 133 | } |
| 134 | |
| 135 | void A() |
| 136 | { |
| 137 | B(); |
| 138 | } |
| 139 | |
| 140 | int main(int argc, const char* argv[]) |
| 141 | { |
| 142 | A(); |
| 143 | } |
| 144 | ``` |
| 145 | |
| 146 | The `DIScope` hierarchy would be: |
| 147 | |
| 148 | ```C++ |
| 149 | DIFile: "foo.cpp" |
| 150 | rr::DebugInfo::diScope[0].di: ↳ DISubprogram: "main" |
| 151 | rr::DebugInfo::diScope[1].di: ↳ DISubprogram: "A" |
| 152 | rr::DebugInfo::diScope[2].di: ↳ DISubprogram: "B" |
| 153 | ``` |
| 154 | |
| 155 | The `DILocation` hierarchy would be: |
| 156 | |
| 157 | ```C++ |
| 158 | rr::DebugInfo::diRootLocation: DILocation(DISubprogram: "ReactorFunction") |
| 159 | rr::DebugInfo::diScope[0].location: ↳ DILocation(DISubprogram: "main") |
| 160 | rr::DebugInfo::diScope[1].location: ↳ DILocation(DISubprogram: "A") |
| 161 | rr::DebugInfo::diScope[2].location: ↳ DILocation(DISubprogram: "B") |
| 162 | ``` |
| 163 | |
| 164 | Where '↳' represents an `InlinedAt`. |
| 165 | |
| 166 | |
| 167 | `rr::DebugInfo::diScope` is updated by `rr::DebugInfo::syncScope()`. |
| 168 | |
| 169 | `llvm::DIScope`s typically do not nest - there is usually a separate |
| 170 | `llvm::DISubprogram` for each function in the callstack. All local variables |
| 171 | within a function will typically share the same scope, regardless of whether |
| 172 | they are declared within a sub-block. |
| 173 | |
| 174 | Loops and jumps within a function add complexity. Consider: |
| 175 | |
| 176 | ```C++ |
| 177 | void B() |
| 178 | { |
| 179 | rr::Int i = 0; |
| 180 | } |
| 181 | |
| 182 | void A() |
| 183 | { |
| 184 | for (int i = 0; i < 3; i++) |
| 185 | { |
| 186 | rr::Int x = 0; |
| 187 | } |
| 188 | B(); |
| 189 | } |
| 190 | |
| 191 | int main(int argc, const char* argv[]) |
| 192 | { |
| 193 | A(); |
| 194 | } |
| 195 | ``` |
| 196 | |
| 197 | In this particular example Reactor will not be aware of the `for` loop, and will |
| 198 | attempt to create three variables called "`x`" in the same function scope for `A()`. |
| 199 | Duplicate symbols in the same `llvm::DIScope` result in undefined behavior. |
| 200 | |
| 201 | To solve this, `rr::DebugInfo::syncScope()` observes when a function jumps |
| 202 | backwards, and forks the current `llvm::DILexicalBlock` for the function. This |
| 203 | results in a number of `llvm::DILexicalBlock` chains, each declaring variables |
| 204 | that shadow the previous block. |
| 205 | |
| 206 | At the declaration of `i`, the `DIScope` hierarchy would be: |
| 207 | |
| 208 | ```C++ |
| 209 | DIFile: "foo.cpp" |
| 210 | rr::DebugInfo::diScope[0].di: ↳ DISubprogram: "main" |
| 211 | ↳ DISubprogram: "A" |
| 212 | | ↳ DILexicalBlock: "A".1 |
| 213 | rr::DebugInfo::diScope[1].di: | ↳ DILexicalBlock: "A".2 |
| 214 | rr::DebugInfo::diScope[2].di: ↳ DISubprogram: "B" |
| 215 | ``` |
| 216 | |
| 217 | The `DILocation` hierarchy would be: |
| 218 | |
| 219 | ```C++ |
| 220 | rr::DebugInfo::diRootLocation: DILocation(DISubprogram: "ReactorFunction") |
| 221 | rr::DebugInfo::diScope[0].location: ↳ DILocation(DISubprogram: "main") |
| 222 | rr::DebugInfo::diScope[1].location: ↳ DILocation(DILexicalBlock: "A".2) |
| 223 | rr::DebugInfo::diScope[2].location: ↳ DILocation(DISubprogram: "B") |
| 224 | ``` |
| 225 | |
| 226 | ### Debugger integration |
| 227 | |
| 228 | Once the debug information has been generated, it needs to be handed to the |
| 229 | debugger. |
| 230 | |
| 231 | Reactor uses [`llvm::JITEventListener::createGDBRegistrationListener()`](http://llvm.org/doxygen/classllvm_1_1JITEventListener.html#a004abbb5a0d48ac376dfbe3e3c97c306) |
| 232 | to inform GDB of the JIT'd program and its debugging information. |
| 233 | More information [can be found here](https://llvm.org/docs/DebuggingJITedCode.html). |
| 234 | |
| 235 | LLDB should be able to support this same mechanism, but at the time of writing |
| 236 | this does not appear to work. |
| 237 | |