third_party/subzero/docs/ALLOCATION.rst - SwiftShader - Git at Google

 Object allocation and lifetime in ICE
 =====================================

 This document discusses object lifetime and scoping issues, starting with
 bitcode parsing and ending with ELF file emission.

 Multithreaded translation model
 -------------------------------

 A single thread is responsible for parsing PNaCl bitcode (possibly concurrently
 with downloading the bitcode file) and constructing the initial high-level ICE.
 The result is a queue of Cfg pointers.  The parser thread incrementally adds a
 Cfg pointer to the queue after the Cfg is created, and then moves on to parse
 the next function.

 Multiple translation worker threads draw from the queue of Cfg pointers as they
 are added to the queue, such that several functions can be translated in parallel.
 The result is a queue of assembler buffers, each of which consists of machine code
 plus fixups.

 A single thread is responsible for writing the assembler buffers to an ELF file.
 It consumes the assembler buffers from the queue that the translation threads
 write to.

 This means that Cfgs are created by the parser thread and destroyed by the
 translation thread (including Cfg nodes, instructions, and most kinds of
 operands), and assembler buffers are created by the translation thread and
 destroyed by the writer thread.

 Deterministic execution
 ^^^^^^^^^^^^^^^^^^^^^^^

 Although code randomization is a key aspect of security, deterministic and
 repeatable translation is sometimes needed, e.g. for regression testing.
 Multithreaded translation introduces potential for randomness that may need to
 be made deterministic.

 * Bitcode parsing is sequential, so it's easy to use a FIFO queue to keep the
   translation queue in deterministic order.  But since translation is
   multithreaded, FIFO order for the assembler buffer queue may not be
   deterministic.  The writer thread would be responsible for reordering the
   buffers, potentially waiting for slower translations to complete even if other
   assembler buffers are available.

 * Different translation threads may add new constant pool entries at different
   times.  Some constant pool entries are emitted as read-only data.  This
   includes floating-point constants for x86, as well as integer immediate
   randomization through constant pooling.  These constant pool entries are
   emitted after all assembler buffers have been written.  The writer needs to be
   able to sort them deterministically before emitting them.

 Object lifetimes
 ----------------

 Objects of type Constant, or a subclass of Constant, are pooled globally.  The
 pooling is managed by the GlobalContext class.  Since Constants are added or
 looked up by translation threads and the parser thread, access to the constant
 pools, as well as GlobalContext in general, need to be arbitrated by locks.
 (It's possible that if there's too much contention, we can maintain a
 thread-local cache for Constant pool lookups.)  Constants live across all
 function translations, and are destroyed only at the end.

 Several object types are scoped within the lifetime of the Cfg.  These include
 CfgNode, Inst, Variable, and any target-specific subclasses of Inst and Operand.
 When the Cfg is destroyed, these scoped objects are destroyed as well.  To keep
 this cheap, the Cfg includes a slab allocator from which these objects are
 allocated, and the objects should not contain fields with non-trivial
 destructors.  Most of these fields are POD, but in a couple of cases these
 fields are STL containers.  We deal with this, and avoid leaking memory, by
 providing the container with an allocator that uses the Cfg-local slab
 allocator.  Since the container allocator generally needs to be stateless, we
 store a pointer to the slab allocator in thread-local storage (TLS).  This is
 straightforward since on any of the threads, only one Cfg is active at a time,
 and a given Cfg is only active in one thread at a time (either the parser
 thread, or at most one translation thread, or the writer thread).

 Even though there is a one-to-one correspondence between Cfgs and assembler
 buffers, they need to use different allocators.  This is because the translation
 thread wants to destroy the Cfg and reclaim all its memory after translation
 completes, but possibly before the assembly buffer is written to the ELF file.
 Ownership of the assembler buffer and its allocator are transferred to the
 writer thread after translation completes, similar to the way ownership of the
 Cfg and its allocator are transferred to the translation thread after parsing
 completes.

 Allocators and TLS
 ------------------

 Part of the Cfg building, and transformations on the Cfg, include STL container
 operations which may need to allocate additional memory in a stateless fashion.
 This requires maintaining the proper slab allocator pointer in TLS.

 When the parser thread creates a new Cfg object, it puts a pointer to the Cfg's
 slab allocator into its own TLS.  This is used as the Cfg is built within the
 parser thread.  After the Cfg is built, the parser thread clears its allocator
 pointer, adds the new Cfg pointer to the translation queue, continues with the
 next function.

 When the translation thread grabs a new Cfg pointer, it installs the Cfg's slab
 allocator into its TLS and translates the function.  When generating the
 assembly buffer, it must take care not to use the Cfg's slab allocator.  If
 there is a slab allocator for the assembler buffer, a pointer to it can also be
 installed in TLS if needed.

 The translation thread destroys the Cfg when it is done translating, including
 the Cfg's slab allocator, and clears the allocator pointer from its TLS.
 Likewise, the writer thread destroys the assembler buffer when it is finished
 with it.

 Thread safety
 -------------

 The parse/translate/write stages of the translation pipeline are fairly
 independent, with little opportunity for threads to interfere.  The Subzero
 design calls for all shared accesses to go through the GlobalContext, which adds
 locking as appropriate.  This includes the coarse-grain work queues for Cfgs and
 assembler buffers.  It also includes finer-grain access to constant pool
 entries, as well as output streams for verbose debugging output.

 If locked access to constant pools becomes a bottleneck, we can investigate
 thread-local caches of constants (as mentioned earlier).  Also, it should be
 safe though slightly less efficient to allow duplicate copies of constants
 across threads (which could be de-dupped by the writer at the end).

 We will use ThreadSanitizer as a way to detect potential data races in the
 implementation.
	Object allocation and lifetime in ICE
	=====================================

	This document discusses object lifetime and scoping issues, starting with
	bitcode parsing and ending with ELF file emission.

	Multithreaded translation model
	-------------------------------

	A single thread is responsible for parsing PNaCl bitcode (possibly concurrently
	with downloading the bitcode file) and constructing the initial high-level ICE.
	The result is a queue of Cfg pointers. The parser thread incrementally adds a
	Cfg pointer to the queue after the Cfg is created, and then moves on to parse
	the next function.

	Multiple translation worker threads draw from the queue of Cfg pointers as they
	are added to the queue, such that several functions can be translated in parallel.
	The result is a queue of assembler buffers, each of which consists of machine code
	plus fixups.

	A single thread is responsible for writing the assembler buffers to an ELF file.
	It consumes the assembler buffers from the queue that the translation threads
	write to.

	This means that Cfgs are created by the parser thread and destroyed by the
	translation thread (including Cfg nodes, instructions, and most kinds of
	operands), and assembler buffers are created by the translation thread and
	destroyed by the writer thread.

	Deterministic execution
	^^^^^^^^^^^^^^^^^^^^^^^

	Although code randomization is a key aspect of security, deterministic and
	repeatable translation is sometimes needed, e.g. for regression testing.
	Multithreaded translation introduces potential for randomness that may need to
	be made deterministic.

	* Bitcode parsing is sequential, so it's easy to use a FIFO queue to keep the
	translation queue in deterministic order. But since translation is
	multithreaded, FIFO order for the assembler buffer queue may not be
	deterministic. The writer thread would be responsible for reordering the
	buffers, potentially waiting for slower translations to complete even if other
	assembler buffers are available.

	* Different translation threads may add new constant pool entries at different
	times. Some constant pool entries are emitted as read-only data. This
	includes floating-point constants for x86, as well as integer immediate
	randomization through constant pooling. These constant pool entries are
	emitted after all assembler buffers have been written. The writer needs to be
	able to sort them deterministically before emitting them.

	Object lifetimes
	----------------

	Objects of type Constant, or a subclass of Constant, are pooled globally. The
	pooling is managed by the GlobalContext class. Since Constants are added or
	looked up by translation threads and the parser thread, access to the constant
	pools, as well as GlobalContext in general, need to be arbitrated by locks.
	(It's possible that if there's too much contention, we can maintain a
	thread-local cache for Constant pool lookups.) Constants live across all
	function translations, and are destroyed only at the end.

	Several object types are scoped within the lifetime of the Cfg. These include
	CfgNode, Inst, Variable, and any target-specific subclasses of Inst and Operand.
	When the Cfg is destroyed, these scoped objects are destroyed as well. To keep
	this cheap, the Cfg includes a slab allocator from which these objects are
	allocated, and the objects should not contain fields with non-trivial
	destructors. Most of these fields are POD, but in a couple of cases these
	fields are STL containers. We deal with this, and avoid leaking memory, by
	providing the container with an allocator that uses the Cfg-local slab
	allocator. Since the container allocator generally needs to be stateless, we
	store a pointer to the slab allocator in thread-local storage (TLS). This is
	straightforward since on any of the threads, only one Cfg is active at a time,
	and a given Cfg is only active in one thread at a time (either the parser
	thread, or at most one translation thread, or the writer thread).

	Even though there is a one-to-one correspondence between Cfgs and assembler
	buffers, they need to use different allocators. This is because the translation
	thread wants to destroy the Cfg and reclaim all its memory after translation
	completes, but possibly before the assembly buffer is written to the ELF file.
	Ownership of the assembler buffer and its allocator are transferred to the
	writer thread after translation completes, similar to the way ownership of the
	Cfg and its allocator are transferred to the translation thread after parsing
	completes.

	Allocators and TLS
	------------------

	Part of the Cfg building, and transformations on the Cfg, include STL container
	operations which may need to allocate additional memory in a stateless fashion.
	This requires maintaining the proper slab allocator pointer in TLS.

	When the parser thread creates a new Cfg object, it puts a pointer to the Cfg's
	slab allocator into its own TLS. This is used as the Cfg is built within the
	parser thread. After the Cfg is built, the parser thread clears its allocator
	pointer, adds the new Cfg pointer to the translation queue, continues with the
	next function.

	When the translation thread grabs a new Cfg pointer, it installs the Cfg's slab
	allocator into its TLS and translates the function. When generating the
	assembly buffer, it must take care not to use the Cfg's slab allocator. If
	there is a slab allocator for the assembler buffer, a pointer to it can also be
	installed in TLS if needed.

	The translation thread destroys the Cfg when it is done translating, including
	the Cfg's slab allocator, and clears the allocator pointer from its TLS.
	Likewise, the writer thread destroys the assembler buffer when it is finished
	with it.

	Thread safety
	-------------

	The parse/translate/write stages of the translation pipeline are fairly
	independent, with little opportunity for threads to interfere. The Subzero
	design calls for all shared accesses to go through the GlobalContext, which adds
	locking as appropriate. This includes the coarse-grain work queues for Cfgs and
	assembler buffers. It also includes finer-grain access to constant pool
	entries, as well as output streams for verbose debugging output.

	If locked access to constant pools becomes a bottleneck, we can investigate
	thread-local caches of constants (as mentioned earlier). Also, it should be
	safe though slightly less efficient to allow duplicate copies of constants
	across threads (which could be de-dupped by the writer at the end).

	We will use ThreadSanitizer as a way to detect potential data races in the
	implementation.