| ====================================== |
| Kaleidoscope: Adding Debug Information |
| ====================================== |
| |
| .. contents:: |
| :local: |
| |
| Chapter 9 Introduction |
| ====================== |
| |
| Welcome to Chapter 9 of the "`Implementing a language with |
| LLVM <index.html>`_" tutorial. In chapters 1 through 8, we've built a |
| decent little programming language with functions and variables. |
| What happens if something goes wrong though, how do you debug your |
| program? |
| |
| Source level debugging uses formatted data that helps a debugger |
| translate from binary and the state of the machine back to the |
| source that the programmer wrote. In LLVM we generally use a format |
| called `DWARF <http://dwarfstd.org>`_. DWARF is a compact encoding |
| that represents types, source locations, and variable locations. |
| |
| The short summary of this chapter is that we'll go through the |
| various things you have to add to a programming language to |
| support debug info, and how you translate that into DWARF. |
| |
| Caveat: For now we can't debug via the JIT, so we'll need to compile |
| our program down to something small and standalone. As part of this |
| we'll make a few modifications to the running of the language and |
| how programs are compiled. This means that we'll have a source file |
| with a simple program written in Kaleidoscope rather than the |
| interactive JIT. It does involve a limitation that we can only |
| have one "top level" command at a time to reduce the number of |
| changes necessary. |
| |
| Here's the sample program we'll be compiling: |
| |
| .. code-block:: python |
| |
| def fib(x) |
| if x < 3 then |
| 1 |
| else |
| fib(x-1)+fib(x-2); |
| |
| fib(10) |
| |
| |
| Why is this a hard problem? |
| =========================== |
| |
| Debug information is a hard problem for a few different reasons - mostly |
| centered around optimized code. First, optimization makes keeping source |
| locations more difficult. In LLVM IR we keep the original source location |
| for each IR level instruction on the instruction. Optimization passes |
| should keep the source locations for newly created instructions, but merged |
| instructions only get to keep a single location - this can cause jumping |
| around when stepping through optimized programs. Secondly, optimization |
| can move variables in ways that are either optimized out, shared in memory |
| with other variables, or difficult to track. For the purposes of this |
| tutorial we're going to avoid optimization (as you'll see with one of the |
| next sets of patches). |
| |
| Ahead-of-Time Compilation Mode |
| ============================== |
| |
| To highlight only the aspects of adding debug information to a source |
| language without needing to worry about the complexities of JIT debugging |
| we're going to make a few changes to Kaleidoscope to support compiling |
| the IR emitted by the front end into a simple standalone program that |
| you can execute, debug, and see results. |
| |
| First we make our anonymous function that contains our top level |
| statement be our "main": |
| |
| .. code-block:: udiff |
| |
| - auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>()); |
| + auto Proto = llvm::make_unique<PrototypeAST>("main", std::vector<std::string>()); |
| |
| just with the simple change of giving it a name. |
| |
| Then we're going to remove the command line code wherever it exists: |
| |
| .. code-block:: udiff |
| |
| @@ -1129,7 +1129,6 @@ static void HandleTopLevelExpression() { |
| /// top ::= definition | external | expression | ';' |
| static void MainLoop() { |
| while (1) { |
| - fprintf(stderr, "ready> "); |
| switch (CurTok) { |
| case tok_eof: |
| return; |
| @@ -1184,7 +1183,6 @@ int main() { |
| BinopPrecedence['*'] = 40; // highest. |
| |
| // Prime the first token. |
| - fprintf(stderr, "ready> "); |
| getNextToken(); |
| |
| Lastly we're going to disable all of the optimization passes and the JIT so |
| that the only thing that happens after we're done parsing and generating |
| code is that the LLVM IR goes to standard error: |
| |
| .. code-block:: udiff |
| |
| @@ -1108,17 +1108,8 @@ static void HandleExtern() { |
| static void HandleTopLevelExpression() { |
| // Evaluate a top-level expression into an anonymous function. |
| if (auto FnAST = ParseTopLevelExpr()) { |
| - if (auto *FnIR = FnAST->codegen()) { |
| - // We're just doing this to make sure it executes. |
| - TheExecutionEngine->finalizeObject(); |
| - // JIT the function, returning a function pointer. |
| - void *FPtr = TheExecutionEngine->getPointerToFunction(FnIR); |
| - |
| - // Cast it to the right type (takes no arguments, returns a double) so we |
| - // can call it as a native function. |
| - double (*FP)() = (double (*)())(intptr_t)FPtr; |
| - // Ignore the return value for this. |
| - (void)FP; |
| + if (!F->codegen()) { |
| + fprintf(stderr, "Error generating code for top level expr"); |
| } |
| } else { |
| // Skip token for error recovery. |
| @@ -1439,11 +1459,11 @@ int main() { |
| // target lays out data structures. |
| TheModule->setDataLayout(TheExecutionEngine->getDataLayout()); |
| OurFPM.add(new DataLayoutPass()); |
| +#if 0 |
| OurFPM.add(createBasicAliasAnalysisPass()); |
| // Promote allocas to registers. |
| OurFPM.add(createPromoteMemoryToRegisterPass()); |
| @@ -1218,7 +1210,7 @@ int main() { |
| OurFPM.add(createGVNPass()); |
| // Simplify the control flow graph (deleting unreachable blocks, etc). |
| OurFPM.add(createCFGSimplificationPass()); |
| - |
| + #endif |
| OurFPM.doInitialization(); |
| |
| // Set the global so the code gen can use this. |
| |
| This relatively small set of changes get us to the point that we can compile |
| our piece of Kaleidoscope language down to an executable program via this |
| command line: |
| |
| .. code-block:: bash |
| |
| Kaleidoscope-Ch9 < fib.ks | & clang -x ir - |
| |
| which gives an a.out/a.exe in the current working directory. |
| |
| Compile Unit |
| ============ |
| |
| The top level container for a section of code in DWARF is a compile unit. |
| This contains the type and function data for an individual translation unit |
| (read: one file of source code). So the first thing we need to do is |
| construct one for our fib.ks file. |
| |
| DWARF Emission Setup |
| ==================== |
| |
| Similar to the ``IRBuilder`` class we have a |
| `DIBuilder <http://llvm.org/doxygen/classllvm_1_1DIBuilder.html>`_ class |
| that helps in constructing debug metadata for an LLVM IR file. It |
| corresponds 1:1 similarly to ``IRBuilder`` and LLVM IR, but with nicer names. |
| Using it does require that you be more familiar with DWARF terminology than |
| you needed to be with ``IRBuilder`` and ``Instruction`` names, but if you |
| read through the general documentation on the |
| `Metadata Format <http://llvm.org/docs/SourceLevelDebugging.html>`_ it |
| should be a little more clear. We'll be using this class to construct all |
| of our IR level descriptions. Construction for it takes a module so we |
| need to construct it shortly after we construct our module. We've left it |
| as a global static variable to make it a bit easier to use. |
| |
| Next we're going to create a small container to cache some of our frequent |
| data. The first will be our compile unit, but we'll also write a bit of |
| code for our one type since we won't have to worry about multiple typed |
| expressions: |
| |
| .. code-block:: c++ |
| |
| static DIBuilder *DBuilder; |
| |
| struct DebugInfo { |
| DICompileUnit *TheCU; |
| DIType *DblTy; |
| |
| DIType *getDoubleTy(); |
| } KSDbgInfo; |
| |
| DIType *DebugInfo::getDoubleTy() { |
| if (DblTy) |
| return DblTy; |
| |
| DblTy = DBuilder->createBasicType("double", 64, dwarf::DW_ATE_float); |
| return DblTy; |
| } |
| |
| And then later on in ``main`` when we're constructing our module: |
| |
| .. code-block:: c++ |
| |
| DBuilder = new DIBuilder(*TheModule); |
| |
| KSDbgInfo.TheCU = DBuilder->createCompileUnit( |
| dwarf::DW_LANG_C, DBuilder->createFile("fib.ks", "."), |
| "Kaleidoscope Compiler", 0, "", 0); |
| |
| There are a couple of things to note here. First, while we're producing a |
| compile unit for a language called Kaleidoscope we used the language |
| constant for C. This is because a debugger wouldn't necessarily understand |
| the calling conventions or default ABI for a language it doesn't recognize |
| and we follow the C ABI in our LLVM code generation so it's the closest |
| thing to accurate. This ensures we can actually call functions from the |
| debugger and have them execute. Secondly, you'll see the "fib.ks" in the |
| call to ``createCompileUnit``. This is a default hard coded value since |
| we're using shell redirection to put our source into the Kaleidoscope |
| compiler. In a usual front end you'd have an input file name and it would |
| go there. |
| |
| One last thing as part of emitting debug information via DIBuilder is that |
| we need to "finalize" the debug information. The reasons are part of the |
| underlying API for DIBuilder, but make sure you do this near the end of |
| main: |
| |
| .. code-block:: c++ |
| |
| DBuilder->finalize(); |
| |
| before you dump out the module. |
| |
| Functions |
| ========= |
| |
| Now that we have our ``Compile Unit`` and our source locations, we can add |
| function definitions to the debug info. So in ``PrototypeAST::codegen()`` we |
| add a few lines of code to describe a context for our subprogram, in this |
| case the "File", and the actual definition of the function itself. |
| |
| So the context: |
| |
| .. code-block:: c++ |
| |
| DIFile *Unit = DBuilder->createFile(KSDbgInfo.TheCU.getFilename(), |
| KSDbgInfo.TheCU.getDirectory()); |
| |
| giving us an DIFile and asking the ``Compile Unit`` we created above for the |
| directory and filename where we are currently. Then, for now, we use some |
| source locations of 0 (since our AST doesn't currently have source location |
| information) and construct our function definition: |
| |
| .. code-block:: c++ |
| |
| DIScope *FContext = Unit; |
| unsigned LineNo = 0; |
| unsigned ScopeLine = 0; |
| DISubprogram *SP = DBuilder->createFunction( |
| FContext, P.getName(), StringRef(), Unit, LineNo, |
| CreateFunctionType(TheFunction->arg_size(), Unit), |
| false /* internal linkage */, true /* definition */, ScopeLine, |
| DINode::FlagPrototyped, false); |
| TheFunction->setSubprogram(SP); |
| |
| and we now have an DISubprogram that contains a reference to all of our |
| metadata for the function. |
| |
| Source Locations |
| ================ |
| |
| The most important thing for debug information is accurate source location - |
| this makes it possible to map your source code back. We have a problem though, |
| Kaleidoscope really doesn't have any source location information in the lexer |
| or parser so we'll need to add it. |
| |
| .. code-block:: c++ |
| |
| struct SourceLocation { |
| int Line; |
| int Col; |
| }; |
| static SourceLocation CurLoc; |
| static SourceLocation LexLoc = {1, 0}; |
| |
| static int advance() { |
| int LastChar = getchar(); |
| |
| if (LastChar == '\n' || LastChar == '\r') { |
| LexLoc.Line++; |
| LexLoc.Col = 0; |
| } else |
| LexLoc.Col++; |
| return LastChar; |
| } |
| |
| In this set of code we've added some functionality on how to keep track of the |
| line and column of the "source file". As we lex every token we set our current |
| current "lexical location" to the assorted line and column for the beginning |
| of the token. We do this by overriding all of the previous calls to |
| ``getchar()`` with our new ``advance()`` that keeps track of the information |
| and then we have added to all of our AST classes a source location: |
| |
| .. code-block:: c++ |
| |
| class ExprAST { |
| SourceLocation Loc; |
| |
| public: |
| ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {} |
| virtual ~ExprAST() {} |
| virtual Value* codegen() = 0; |
| int getLine() const { return Loc.Line; } |
| int getCol() const { return Loc.Col; } |
| virtual raw_ostream &dump(raw_ostream &out, int ind) { |
| return out << ':' << getLine() << ':' << getCol() << '\n'; |
| } |
| |
| that we pass down through when we create a new expression: |
| |
| .. code-block:: c++ |
| |
| LHS = llvm::make_unique<BinaryExprAST>(BinLoc, BinOp, std::move(LHS), |
| std::move(RHS)); |
| |
| giving us locations for each of our expressions and variables. |
| |
| To make sure that every instruction gets proper source location information, |
| we have to tell ``Builder`` whenever we're at a new source location. |
| We use a small helper function for this: |
| |
| .. code-block:: c++ |
| |
| void DebugInfo::emitLocation(ExprAST *AST) { |
| DIScope *Scope; |
| if (LexicalBlocks.empty()) |
| Scope = TheCU; |
| else |
| Scope = LexicalBlocks.back(); |
| Builder.SetCurrentDebugLocation( |
| DebugLoc::get(AST->getLine(), AST->getCol(), Scope)); |
| } |
| |
| This both tells the main ``IRBuilder`` where we are, but also what scope |
| we're in. The scope can either be on compile-unit level or be the nearest |
| enclosing lexical block like the current function. |
| To represent this we create a stack of scopes: |
| |
| .. code-block:: c++ |
| |
| std::vector<DIScope *> LexicalBlocks; |
| |
| and push the scope (function) to the top of the stack when we start |
| generating the code for each function: |
| |
| .. code-block:: c++ |
| |
| KSDbgInfo.LexicalBlocks.push_back(SP); |
| |
| Also, we may not forget to pop the scope back off of the scope stack at the |
| end of the code generation for the function: |
| |
| .. code-block:: c++ |
| |
| // Pop off the lexical block for the function since we added it |
| // unconditionally. |
| KSDbgInfo.LexicalBlocks.pop_back(); |
| |
| Then we make sure to emit the location every time we start to generate code |
| for a new AST object: |
| |
| .. code-block:: c++ |
| |
| KSDbgInfo.emitLocation(this); |
| |
| Variables |
| ========= |
| |
| Now that we have functions, we need to be able to print out the variables |
| we have in scope. Let's get our function arguments set up so we can get |
| decent backtraces and see how our functions are being called. It isn't |
| a lot of code, and we generally handle it when we're creating the |
| argument allocas in ``FunctionAST::codegen``. |
| |
| .. code-block:: c++ |
| |
| // Record the function arguments in the NamedValues map. |
| NamedValues.clear(); |
| unsigned ArgIdx = 0; |
| for (auto &Arg : TheFunction->args()) { |
| // Create an alloca for this variable. |
| AllocaInst *Alloca = CreateEntryBlockAlloca(TheFunction, Arg.getName()); |
| |
| // Create a debug descriptor for the variable. |
| DILocalVariable *D = DBuilder->createParameterVariable( |
| SP, Arg.getName(), ++ArgIdx, Unit, LineNo, KSDbgInfo.getDoubleTy(), |
| true); |
| |
| DBuilder->insertDeclare(Alloca, D, DBuilder->createExpression(), |
| DebugLoc::get(LineNo, 0, SP), |
| Builder.GetInsertBlock()); |
| |
| // Store the initial value into the alloca. |
| Builder.CreateStore(&Arg, Alloca); |
| |
| // Add arguments to variable symbol table. |
| NamedValues[Arg.getName()] = Alloca; |
| } |
| |
| |
| Here we're first creating the variable, giving it the scope (``SP``), |
| the name, source location, type, and since it's an argument, the argument |
| index. Next, we create an ``lvm.dbg.declare`` call to indicate at the IR |
| level that we've got a variable in an alloca (and it gives a starting |
| location for the variable), and setting a source location for the |
| beginning of the scope on the declare. |
| |
| One interesting thing to note at this point is that various debuggers have |
| assumptions based on how code and debug information was generated for them |
| in the past. In this case we need to do a little bit of a hack to avoid |
| generating line information for the function prologue so that the debugger |
| knows to skip over those instructions when setting a breakpoint. So in |
| ``FunctionAST::CodeGen`` we add some more lines: |
| |
| .. code-block:: c++ |
| |
| // Unset the location for the prologue emission (leading instructions with no |
| // location in a function are considered part of the prologue and the debugger |
| // will run past them when breaking on a function) |
| KSDbgInfo.emitLocation(nullptr); |
| |
| and then emit a new location when we actually start generating code for the |
| body of the function: |
| |
| .. code-block:: c++ |
| |
| KSDbgInfo.emitLocation(Body.get()); |
| |
| With this we have enough debug information to set breakpoints in functions, |
| print out argument variables, and call functions. Not too bad for just a |
| few simple lines of code! |
| |
| Full Code Listing |
| ================= |
| |
| Here is the complete code listing for our running example, enhanced with |
| debug information. To build this example, use: |
| |
| .. code-block:: bash |
| |
| # Compile |
| clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy |
| # Run |
| ./toy |
| |
| Here is the code: |
| |
| .. literalinclude:: ../../examples/Kaleidoscope/Chapter9/toy.cpp |
| :language: c++ |
| |
| `Next: Conclusion and other useful LLVM tidbits <LangImpl10.html>`_ |
| |