docs/ReactorDebugInfo.md - SwiftShader - Git at Google

 # Reactor Debug Info Generation

 ## Introduction

 Reactor produces Just In Time compiled dynamic executable code and can be used to JIT high performance functions specialized for runtime
 configurations, or to even build a compiler.

 In order to debug executable code at a higher level than disassembly, source code files are required.

 Reactor has two potential sources of source code:

 1. The C++ source code of the program that calls into Reactor.
 2. External source files read by the program and passed to Reactor.

 While case (2) is preferable for implementing a compiler, this is currently not
 implemented.

 Reactor implements case (1) and this can be used by GDB to single line step and
 inspect variables.

 ## Supported Platforms

 Currently:

 * Debug info generation is only supported on Linux with the LLVM 7
 backend.
 * GDB is the only supported debugger.
 * The program must be compiled with debug info iteself.

 ## Enabling

 Debug generation is enabled with `REACTOR_EMIT_DEBUG_INFO` CMake flag (defaults
 to disabled).

 ## Implementation details

 ### Source Location

 All Reactor functions begin with a call to `RR_DEBUG_INFO_UPDATE_LOC()`, which calls into `rr::DebugInfo::EmitLocation()`.

 `rr::DebugInfo::EmitLocation()` calls `rr::DebugInfo::getCallerBacktrace()`,
 which in turn uses [`libbacktrace`](https://github.com/ianlancetaylor/libbacktrace)
 to unwind the stack and find the file, function and line of the caller.

 This information is passed to `llvm::IRBuilder<>::SetCurrentDebugLocation`
 to emit source line information for the next LLVM instructions to be built.

 ### Variables

 There are 3 aspects to generating variable debug information:

 #### 1. Variable names

 Constructing a Reactor `LValue`:

 ```C++
 rr::Int a = 1;
 ```

 Will emit an LLVM `alloca` instruction to allocate the storage of the variable,
 and emit another to initialize it to the constant `1`. While fluent, none of the
 Reactor calls see the name of the C++ local variable "`a`", and the LLVM `alloca`
 value gets a meaningless numerical value.

 There are two potential ways that Reactor can obtain the variable name:

 1. Use the running executable's own debug information to examine the local
    declaration and extract the local variable's name.
 2. Use the backtrace information to parse the name from the source file.

 While (1) is arguably a cleaner and more robust solution, (2) is
 easier to implement and can work for the majority of use cases.

 (2) is the current solution implemented.

 `rr::DebugInfo::getOrParseFileTokens()` scans a source file line by line, and
 uses a regular expression to look for patterns of `<type> <name>`. Matching is not
 precise, but is adequate to find locals constructed with and without assignment.

 #### 2. Variable binding

 Given that we can find a variable name for a given source line, we need a way of
 binding the LLVM values to the name.

 Given our trivial example:

 ```C++
 rr::Int a = 1
 ```

 The `rr::Int` constructor calls `RR_DEBUG_INFO_EMIT_VAR()` passing the storage
 value as single argument. `RR_DEBUG_INFO_EMIT_VAR()` performs the backtrace
 to find the source file and line and uses the token information produced by
 `rr::DebugInfo::getOrParseFileTokens()` to identify the variable name.

 However, things get a bit more complicated when there are multiple variables
 being constructed on the same line.

 Take for example:

 ```C++
 rr::Int a = rr::Int(1) + rr::Int(2)
 ```

 Here we have 3 calls to the `rr::Int` constructor, each calling down
 to `RR_DEBUG_INFO_EMIT_VAR()`.

 To disambiguate which of these should be bound to the variable name "`a`",
 `rr::DebugInfo::EmitVariable()` buffers the binding into
 `scope.pending` and the last binding for a given line is used by
 `DebugInfo::emitPending()`. For variable construction and assignment, C++
 guarantees that the LHS is the last value to be constructed.

 This solution is not perfect.

 Multi-line expressions, multiple assignments on a single line, macro obfuscation
 can all break variable bindings - however the majority of typical cases work.

 #### 3. Variable scope

 `rr::DebugInfo` maintains a stack of `llvm::DIScope`s and `llvm::DILocation`s
 that mirrors the current backtrace for function being called.

 A synthetic call stack is produced by chaining `llvm::DILocation`s with
 `InlinedAt`s.

 For example, at the declaration of `i`:

 ```C++
 void B()
 {
     rr::Int i; // <- here
 }

 void A()
 {
     B();
 }

 int main(int argc, const char* argv[])
 {
     A();
 }
 ```

 The `DIScope` hierarchy would be:

 ```C++
                               DIFile: "foo.cpp"
 rr::DebugInfo::diScope[0].di: ↳ DISubprogram: "main"
 rr::DebugInfo::diScope[1].di: ↳ DISubprogram: "A"
 rr::DebugInfo::diScope[2].di: ↳ DISubprogram: "B"
 ```

 The `DILocation` hierarchy would be:

 ```C++
 rr::DebugInfo::diRootLocation:      DILocation(DISubprogram: "ReactorFunction")
 rr::DebugInfo::diScope[0].location: ↳ DILocation(DISubprogram: "main")
 rr::DebugInfo::diScope[1].location:   ↳ DILocation(DISubprogram: "A")
 rr::DebugInfo::diScope[2].location:     ↳ DILocation(DISubprogram: "B")
 ```

 Where '↳' represents an `InlinedAt`.


 `rr::DebugInfo::diScope` is updated by `rr::DebugInfo::syncScope()`.

 `llvm::DIScope`s typically do not nest - there is usually a separate
 `llvm::DISubprogram` for each function in the callstack. All local variables
 within a function will typically share the same scope, regardless of whether
 they are declared within a sub-block.

 Loops and jumps within a function add complexity. Consider:

 ```C++
 void B()
 {
     rr::Int i = 0;
 }

 void A()
 {
     for (int i = 0; i < 3; i++)
     {
         rr::Int x = 0;
     }
     B();
 }

 int main(int argc, const char* argv[])
 {
     A();
 }
 ```

 In this particular example Reactor will not be aware of the `for` loop, and will
 attempt to create three variables called "`x`" in the same function scope for `A()`.
 Duplicate symbols in the same `llvm::DIScope` result in undefined behavior.

 To solve this, `rr::DebugInfo::syncScope()` observes when a function jumps
 backwards, and forks the current `llvm::DILexicalBlock` for the function. This
 results in a number of `llvm::DILexicalBlock` chains, each declaring variables
 that shadow the previous block.

 At the declaration of `i`, the `DIScope` hierarchy would be:

 ```C++
                               DIFile: "foo.cpp"
 rr::DebugInfo::diScope[0].di: ↳ DISubprogram: "main"
                               ↳ DISubprogram: "A"
                               | ↳ DILexicalBlock: "A".1
 rr::DebugInfo::diScope[1].di: |   ↳ DILexicalBlock: "A".2
 rr::DebugInfo::diScope[2].di: ↳ DISubprogram: "B"
 ```

 The `DILocation` hierarchy would be:

 ```C++
 rr::DebugInfo::diRootLocation:      DILocation(DISubprogram: "ReactorFunction")
 rr::DebugInfo::diScope[0].location: ↳ DILocation(DISubprogram: "main")
 rr::DebugInfo::diScope[1].location:   ↳ DILocation(DILexicalBlock: "A".2)
 rr::DebugInfo::diScope[2].location:     ↳ DILocation(DISubprogram: "B")
 ```

 ### Debugger integration

 Once the debug information has been generated, it needs to be handed to the
 debugger.

 Reactor uses [`llvm::JITEventListener::createGDBRegistrationListener()`](http://llvm.org/doxygen/classllvm_1_1JITEventListener.html#a004abbb5a0d48ac376dfbe3e3c97c306)
 to inform GDB of the JIT'd program and its debugging information.
 More information [can be found here](https://llvm.org/docs/DebuggingJITedCode.html).

 LLDB should be able to support this same mechanism, but at the time of writing
 this does not appear to work.
	# Reactor Debug Info Generation

	## Introduction

	Reactor produces Just In Time compiled dynamic executable code and can be used to JIT high performance functions specialized for runtime
	configurations, or to even build a compiler.

	In order to debug executable code at a higher level than disassembly, source code files are required.

	Reactor has two potential sources of source code:

	1. The C++ source code of the program that calls into Reactor.
	2. External source files read by the program and passed to Reactor.

	While case (2) is preferable for implementing a compiler, this is currently not
	implemented.

	Reactor implements case (1) and this can be used by GDB to single line step and
	inspect variables.

	## Supported Platforms

	Currently:

	* Debug info generation is only supported on Linux with the LLVM 7
	backend.
	* GDB is the only supported debugger.
	* The program must be compiled with debug info iteself.

	## Enabling

	Debug generation is enabled with `REACTOR_EMIT_DEBUG_INFO` CMake flag (defaults
	to disabled).

	## Implementation details

	### Source Location

	All Reactor functions begin with a call to `RR_DEBUG_INFO_UPDATE_LOC()`, which calls into `rr::DebugInfo::EmitLocation()`.

	`rr::DebugInfo::EmitLocation()` calls `rr::DebugInfo::getCallerBacktrace()`,
	which in turn uses [`libbacktrace`](https://github.com/ianlancetaylor/libbacktrace)
	to unwind the stack and find the file, function and line of the caller.

	This information is passed to `llvm::IRBuilder<>::SetCurrentDebugLocation`
	to emit source line information for the next LLVM instructions to be built.

	### Variables

	There are 3 aspects to generating variable debug information:

	#### 1. Variable names

	Constructing a Reactor `LValue`:

	```C++
	rr::Int a = 1;
	```

	Will emit an LLVM `alloca` instruction to allocate the storage of the variable,
	and emit another to initialize it to the constant `1`. While fluent, none of the
	Reactor calls see the name of the C++ local variable "`a`", and the LLVM `alloca`
	value gets a meaningless numerical value.

	There are two potential ways that Reactor can obtain the variable name:

	1. Use the running executable's own debug information to examine the local
	declaration and extract the local variable's name.
	2. Use the backtrace information to parse the name from the source file.

	While (1) is arguably a cleaner and more robust solution, (2) is
	easier to implement and can work for the majority of use cases.

	(2) is the current solution implemented.

	`rr::DebugInfo::getOrParseFileTokens()` scans a source file line by line, and
	uses a regular expression to look for patterns of `<type> <name>`. Matching is not
	precise, but is adequate to find locals constructed with and without assignment.

	#### 2. Variable binding

	Given that we can find a variable name for a given source line, we need a way of
	binding the LLVM values to the name.

	Given our trivial example:

	```C++
	rr::Int a = 1
	```

	The `rr::Int` constructor calls `RR_DEBUG_INFO_EMIT_VAR()` passing the storage
	value as single argument. `RR_DEBUG_INFO_EMIT_VAR()` performs the backtrace
	to find the source file and line and uses the token information produced by
	`rr::DebugInfo::getOrParseFileTokens()` to identify the variable name.

	However, things get a bit more complicated when there are multiple variables
	being constructed on the same line.

	Take for example:

	```C++
	rr::Int a = rr::Int(1) + rr::Int(2)
	```

	Here we have 3 calls to the `rr::Int` constructor, each calling down
	to `RR_DEBUG_INFO_EMIT_VAR()`.

	To disambiguate which of these should be bound to the variable name "`a`",
	`rr::DebugInfo::EmitVariable()` buffers the binding into
	`scope.pending` and the last binding for a given line is used by
	`DebugInfo::emitPending()`. For variable construction and assignment, C++
	guarantees that the LHS is the last value to be constructed.

	This solution is not perfect.

	Multi-line expressions, multiple assignments on a single line, macro obfuscation
	can all break variable bindings - however the majority of typical cases work.

	#### 3. Variable scope

	`rr::DebugInfo` maintains a stack of `llvm::DIScope`s and `llvm::DILocation`s
	that mirrors the current backtrace for function being called.

	A synthetic call stack is produced by chaining `llvm::DILocation`s with
	`InlinedAt`s.

	For example, at the declaration of `i`:

	```C++
	void B()
	{
	rr::Int i; // <- here
	}

	void A()
	{
	B();
	}

	int main(int argc, const char* argv[])
	{
	A();
	}
	```

	The `DIScope` hierarchy would be:

	```C++
	DIFile: "foo.cpp"
	rr::DebugInfo::diScope[0].di: ↳ DISubprogram: "main"
	rr::DebugInfo::diScope[1].di: ↳ DISubprogram: "A"
	rr::DebugInfo::diScope[2].di: ↳ DISubprogram: "B"
	```

	The `DILocation` hierarchy would be:

	```C++
	rr::DebugInfo::diRootLocation: DILocation(DISubprogram: "ReactorFunction")
	rr::DebugInfo::diScope[0].location: ↳ DILocation(DISubprogram: "main")
	rr::DebugInfo::diScope[1].location: ↳ DILocation(DISubprogram: "A")
	rr::DebugInfo::diScope[2].location: ↳ DILocation(DISubprogram: "B")
	```

	Where '↳' represents an `InlinedAt`.


	`rr::DebugInfo::diScope` is updated by `rr::DebugInfo::syncScope()`.

	`llvm::DIScope`s typically do not nest - there is usually a separate
	`llvm::DISubprogram` for each function in the callstack. All local variables
	within a function will typically share the same scope, regardless of whether
	they are declared within a sub-block.

	Loops and jumps within a function add complexity. Consider:

	```C++
	void B()
	{
	rr::Int i = 0;
	}

	void A()
	{
	for (int i = 0; i < 3; i++)
	{
	rr::Int x = 0;
	}
	B();
	}

	int main(int argc, const char* argv[])
	{
	A();
	}
	```

	In this particular example Reactor will not be aware of the `for` loop, and will
	attempt to create three variables called "`x`" in the same function scope for `A()`.
	Duplicate symbols in the same `llvm::DIScope` result in undefined behavior.

	To solve this, `rr::DebugInfo::syncScope()` observes when a function jumps
	backwards, and forks the current `llvm::DILexicalBlock` for the function. This
	results in a number of `llvm::DILexicalBlock` chains, each declaring variables
	that shadow the previous block.

	At the declaration of `i`, the `DIScope` hierarchy would be:

	```C++
	DIFile: "foo.cpp"
	rr::DebugInfo::diScope[0].di: ↳ DISubprogram: "main"
	↳ DISubprogram: "A"
	\| ↳ DILexicalBlock: "A".1
	rr::DebugInfo::diScope[1].di: \| ↳ DILexicalBlock: "A".2
	rr::DebugInfo::diScope[2].di: ↳ DISubprogram: "B"
	```

	The `DILocation` hierarchy would be:

	```C++
	rr::DebugInfo::diRootLocation: DILocation(DISubprogram: "ReactorFunction")
	rr::DebugInfo::diScope[0].location: ↳ DILocation(DISubprogram: "main")
	rr::DebugInfo::diScope[1].location: ↳ DILocation(DILexicalBlock: "A".2)
	rr::DebugInfo::diScope[2].location: ↳ DILocation(DISubprogram: "B")
	```

	### Debugger integration

	Once the debug information has been generated, it needs to be handed to the
	debugger.

	Reactor uses [`llvm::JITEventListener::createGDBRegistrationListener()`](http://llvm.org/doxygen/classllvm_1_1JITEventListener.html#a004abbb5a0d48ac376dfbe3e3c97c306)
	to inform GDB of the JIT'd program and its debugging information.
	More information [can be found here](https://llvm.org/docs/DebuggingJITedCode.html).

	LLDB should be able to support this same mechanism, but at the time of writing
	this does not appear to work.