| ====================================== |
| XRay Flight Data Recorder Trace Format |
| ====================================== |
| |
| :Version: 1 as of 2017-07-20 |
| |
| .. contents:: |
| :local: |
| |
| |
| Introduction |
| ============ |
| |
| When gathering XRay traces in Flight Data Recorder mode, each thread of an |
| application will claim buffers to fill with trace data, which at some point |
| is finalized and flushed. |
| |
| A goal of the profiler is to minimize overhead, the flushed data directly |
| corresponds to the buffer. |
| |
| This document describes the format of a trace file. |
| |
| |
| General |
| ======= |
| |
| Each trace file corresponds to a sequence of events in a particular thread. |
| |
| The file has a header followed by a sequence of discriminated record types. |
| |
| The endianness of byte fields matches the endianess of the platform which |
| produced the trace file. |
| |
| |
| Header Section |
| ============== |
| |
| A trace file begins with a 32 byte header. |
| |
| +-------------------+-----------------+----------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===================+=================+========================================+ |
| | version | ``2`` | Anticipates versioned readers. This | |
| | | | document describes the format when | |
| | | | version == 1 | |
| +-------------------+-----------------+----------------------------------------+ |
| | type | ``2`` | An enumeration encoding the type of | |
| | | | trace. Flight Data Recorder mode | |
| | | | traces have type == 1 | |
| +-------------------+-----------------+----------------------------------------+ |
| | bitfield | ``4`` | Holds parameters that are not aligned | |
| | | | to bytes. Further described below. | |
| +-------------------+-----------------+----------------------------------------+ |
| | cycle_frequency | ``8`` | The frequency in hertz of the CPU | |
| | | | oscillator used to measure duration of | |
| | | | events in ticks. | |
| +-------------------+-----------------+----------------------------------------+ |
| | buffer_size | ``8`` | The size in bytes of the data portion | |
| | | | of the trace following the header. | |
| +-------------------+-----------------+----------------------------------------+ |
| | reserved | ``8`` | Reserved for future use. | |
| +-------------------+-----------------+----------------------------------------+ |
| |
| The bitfield parameter of the file header is composed of the following fields. |
| |
| +-------------------+----------------+-----------------------------------------+ |
| | Field | Size (bits) | Description | |
| +===================+================+=========================================+ |
| | constant_tsc | ``1`` | Whether the platform's timestamp | |
| | | | counter used to record ticks between | |
| | | | events ticks at a constant frequency | |
| | | | despite CPU frequency changes. | |
| | | | 0 == non-constant. 1 == constant. | |
| +-------------------+----------------+-----------------------------------------+ |
| | nonstop_tsc | ``1`` | Whether the tsc continues to count | |
| | | | despite whether the CPU is in a low | |
| | | | power state. 0 == stop. 1 == non-stop. | |
| +-------------------+----------------+-----------------------------------------+ |
| | reserved | ``30`` | Not meaningful. | |
| +-------------------+----------------+-----------------------------------------+ |
| |
| |
| Data Section |
| ============ |
| |
| Following the header in a trace is a data section with size matching the |
| buffer_size field in the header. |
| |
| The data section is a stream of elements of different types. |
| |
| There are a few categories of data in the sequence. |
| |
| - ``Function Records``: Function Records contain the timing of entry into and |
| exit from function execution. Function Records have 8 bytes each. |
| |
| - ``Metadata Records``: Metadata records serve many purposes. Mostly, they |
| capture information that may be too costly to record for each function, but |
| that is required to contextualize the fine-grained timings. They also are used |
| as markers for user-defined Event Data payloads. Metadata records have 16 |
| bytes each. |
| |
| - ``Event Data``: Free form data may be associated with events that are traced |
| by the binary and encode data defined by a handler function. Event data is |
| always preceded with a marker record which indicates how large it is. |
| |
| - ``Function Arguments``: The arguments to some functions are included in the |
| trace. These are either pointer addresses or primitives that are read and |
| logged independently of their types in a high level language. To the tracer, |
| they are all numbers. Function Records that have attached arguments will |
| indicate their presence on the function entry record. We only support logging |
| contiguous function argument sequences starting with argument zero, which will |
| be the "this" pointer for member function invocations. For example, we don't |
| support logging the first and third argument. |
| |
| A reader of the memory format must maintain a state machine. The format makes no |
| attempt to pad for alignment, and it is not seekable. |
| |
| |
| Function Records |
| ---------------- |
| |
| Function Records have an 8 byte layout. This layout encodes information to |
| reconstruct a call stack of instrumented function and their durations. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bits) | Description | |
| +===============+==============+===============================================+ |
| | discriminant | ``1`` | Indicates whether a reader should read a | |
| | | | Function or Metadata record. Set to ``0`` for | |
| | | | Function records. | |
| +---------------+--------------+-----------------------------------------------+ |
| | action | ``3`` | Specifies whether the function is being | |
| | | | entered, exited, or is a non-standard entry | |
| | | | or exit produced by optimizations. | |
| +---------------+--------------+-----------------------------------------------+ |
| | function_id | ``28`` | A numeric ID for the function. Resolved to a | |
| | | | name via the xray instrumentation map. The | |
| | | | instrumentation map is built by xray at | |
| | | | compile time into an object file and pairs | |
| | | | the function ids to addresses. It is used for | |
| | | | patching and as a lookup into the binary's | |
| | | | symbols to obtain names. | |
| +---------------+--------------+-----------------------------------------------+ |
| | tsc_delta | ``32`` | The number of ticks of the timestamp counter | |
| | | | since a previous record recorded a delta or | |
| | | | other TSC resetting event. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| On little-endian machines, the bitfields are ordered from least significant bit |
| bit to most significant bit. A reader can read an 8 bit value and apply the mask |
| ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned |
| shift right by ``0x04`` to obtain the function_id field. |
| |
| On big-endian machine, the bitfields are written in order from most significant |
| bit to least significant bit. A reader would read an 8 bit value and unsigned |
| shift right by 7 bits for the discriminant. The function_id field could be |
| obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. |
| |
| Function action types are as follows. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Type | Number | Description | |
| +===============+==============+===============================================+ |
| | Entry | ``0`` | Typical function entry. | |
| +---------------+--------------+-----------------------------------------------+ |
| | Exit | ``1`` | Typical function exit. | |
| +---------------+--------------+-----------------------------------------------+ |
| | Tail_Exit | ``2`` | An exit from a function due to tail call | |
| | | | optimization. | |
| +---------------+--------------+-----------------------------------------------+ |
| | Entry_Args | ``3`` | A function entry that records arguments. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| Entry_Args records do not contain the arguments themselves. Instead, metadata |
| records for each of the logged args follow the function record in the stream. |
| |
| |
| Metadata Records |
| ---------------- |
| |
| Interspersed throughout the buffer are 16 byte Metadata records. For typically |
| instrumented binaries, they will be sparser than Function records, and they |
| provide a fuller picture of the binary execution state. |
| |
| Metadata record layout is partially record dependent, but they share a common |
| structure. |
| |
| The same bit field rules described for function records apply to the first byte |
| of MetadataRecords. Within this byte, little endian machines use lsb to msb |
| ordering and big endian machines use msb to lsb ordering. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size | Description | |
| +===============+==============+===============================================+ |
| | discriminant | ``1 bit`` | Indicates whether a reader should read a | |
| | | | Function or Metadata record. Set to ``1`` for | |
| | | | Metadata records. | |
| +---------------+--------------+-----------------------------------------------+ |
| | record_kind | ``7 bits`` | The type of Metadata record. | |
| +---------------+--------------+-----------------------------------------------+ |
| | data | ``15 bytes`` | A data field used differently for each record | |
| | | | type. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| Here is a table of the enumerated record kinds. |
| |
| +--------+---------------------------+ |
| | Number | Type | |
| +========+===========================+ |
| | 0 | NewBuffer | |
| +--------+---------------------------+ |
| | 1 | EndOfBuffer | |
| +--------+---------------------------+ |
| | 2 | NewCPUId | |
| +--------+---------------------------+ |
| | 3 | TSCWrap | |
| +--------+---------------------------+ |
| | 4 | WallTimeMarker | |
| +--------+---------------------------+ |
| | 5 | CustomEventMarker | |
| +--------+---------------------------+ |
| | 6 | CallArgument | |
| +--------+---------------------------+ |
| |
| |
| NewBuffer Records |
| ----------------- |
| |
| Each buffer begins with a NewBuffer record immediately after the header. |
| It records the thread ID of the thread that the trace belongs to. |
| |
| Its data segment is as follows. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===============+==============+===============================================+ |
| | thread_Id | ``2`` | Thread ID for buffer. | |
| +---------------+--------------+-----------------------------------------------+ |
| | reserved | ``13`` | Unused. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| |
| WallClockTime Records |
| --------------------- |
| |
| Following the NewBuffer record, each buffer records an absolute time as a frame |
| of reference for the durations recorded by timestamp counter deltas. |
| |
| Its data segment is as follows. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===============+==============+===============================================+ |
| | seconds | ``8`` | Seconds on absolute timescale. The starting | |
| | | | point is unspecified and depends on the | |
| | | | implementation and platform configured by the | |
| | | | tracer. | |
| +---------------+--------------+-----------------------------------------------+ |
| | microseconds | ``4`` | The microsecond component of the time. | |
| +---------------+--------------+-----------------------------------------------+ |
| | reserved | ``3`` | Unused. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| |
| NewCpuId Records |
| ---------------- |
| |
| Each function entry invokes a routine to determine what CPU is executing. |
| Typically, this is done with readtscp, which reads the timestamp counter at the |
| same time. |
| |
| If the tracing detects that the execution has switched CPUs or if this is the |
| first instrumented entry point, the tracer will output a NewCpuId record. |
| |
| Its data segment is as follows. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===============+==============+===============================================+ |
| | cpu_id | ``2`` | CPU Id. | |
| +---------------+--------------+-----------------------------------------------+ |
| | absolute_tsc | ``8`` | The absolute value of the timestamp counter. | |
| +---------------+--------------+-----------------------------------------------+ |
| | reserved | ``5`` | Unused. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| |
| TSCWrap Records |
| --------------- |
| |
| Since each function record uses a 32 bit value to represent the number of ticks |
| of the timestamp counter since the last reference, it is possible for this value |
| to overflow, particularly for sparsely instrumented binaries. |
| |
| When this delta would not fit into a 32 bit representation, a reference absolute |
| timestamp counter record is written in the form of a TSCWrap record. |
| |
| Its data segment is as follows. |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===============+==============+===============================================+ |
| | absolute_tsc | ``8`` | Timestamp counter value. | |
| +---------------+--------------+-----------------------------------------------+ |
| | reserved | ``7`` | Unused. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| |
| CallArgument Records |
| -------------------- |
| |
| Immediately following an Entry_Args type function record, there may be one or |
| more CallArgument records that contain the traced function's parameter values. |
| |
| The order of the CallArgument Record sequency corresponds one to one with the |
| order of the function parameters. |
| |
| CallArgument data segment: |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===============+==============+===============================================+ |
| | argument | ``8`` | Numeric argument (may be pointer address). | |
| +---------------+--------------+-----------------------------------------------+ |
| | reserved | ``7`` | Unused. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| |
| CustomEventMarker Records |
| ------------------------- |
| |
| XRay provides the feature of logging custom events. This may be leveraged to |
| record tracing info for RPCs or similarly trace data that is application |
| specific. |
| |
| Custom Events themselves are an unstructured (application defined) segment of |
| memory with arbitrary size within the buffer. They are preceded by |
| CustomEventMarkers to indicate their presence and size. |
| |
| CustomEventMarker data segment: |
| |
| +---------------+--------------+-----------------------------------------------+ |
| | Field | Size (bytes) | Description | |
| +===============+==============+===============================================+ |
| | event_size | ``4`` | Size of preceded event. | |
| +---------------+--------------+-----------------------------------------------+ |
| | absolute_tsc | ``8`` | A timestamp counter of the event. | |
| +---------------+--------------+-----------------------------------------------+ |
| | reserved | ``3`` | Unused. | |
| +---------------+--------------+-----------------------------------------------+ |
| |
| |
| EndOfBuffer Records |
| ------------------- |
| |
| An EndOfBuffer record type indicates that there is no more trace data in this |
| buffer. The reader is expected to seek past the remaining buffer_size expressed |
| before the start of buffer and look for either another header or EOF. |
| |
| |
| Format Grammar and Invariants |
| ============================= |
| |
| Not all sequences of Metadata records and Function records are valid data. A |
| sequence should be parsed as a state machine. The expectations for a valid |
| format can be expressed as a context free grammar. |
| |
| This is an attempt to explain the format with statements in EBNF format. |
| |
| - Format := Header ThreadBuffer* EOF |
| |
| - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End |
| |
| - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent |
| |
| - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type |
| |
| - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory |
| |
| - End := EndOfBuffer RemainingBufferSizeToSkip |
| |
| |
| Function Record Order |
| --------------------- |
| |
| There are a few clarifications that may help understand what is expected of |
| Function records. |
| |
| - Functions with an Exit are expected to have a corresponding Entry or |
| Entry_Args function record precede them in the trace. |
| |
| - Tail_Exit Function records record the Function ID of the function whose return |
| address the program counter will take. In other words, the final function that |
| would be popped off of the call stack if tail call optimization was not used. |
| |
| - Not all functions marked for instrumentation are necessarily in the trace. The |
| tracer uses heuristics to preserve the trace for non-trivial functions. |
| |
| - Not every entry must have a traced Exit or Tail Exit. The buffer may run out |
| of space or the program may request for the tracer to finalize toreturn the |
| buffer before an instrumented function exits. |