|  | # reftable | 
|  |  | 
|  | [TOC] | 
|  |  | 
|  | ## Overview | 
|  |  | 
|  | ### Problem statement | 
|  |  | 
|  | Some repositories contain a lot of references (e.g.  android at 866k, | 
|  | rails at 31k).  The existing packed-refs format takes up a lot of | 
|  | space (e.g.  62M), and does not scale with additional references. | 
|  | Lookup of a single reference requires linearly scanning the file. | 
|  |  | 
|  | Atomic pushes modifying multiple references require copying the | 
|  | entire packed-refs file, which can be a considerable amount of data | 
|  | moved (e.g. 62M in, 62M out) for even small transactions (2 refs | 
|  | modified). | 
|  |  | 
|  | Repositories with many loose references occupy a large number of disk | 
|  | blocks from the local file system, as each reference is its own file | 
|  | storing 41 bytes (and another file for the corresponding reflog). | 
|  | This negatively affects the number of inodes available when a large | 
|  | number of repositories are stored on the same filesystem.  Readers can | 
|  | be penalized due to the larger number of syscalls required to traverse | 
|  | and read the `$GIT_DIR/refs` directory. | 
|  |  | 
|  | ### Objectives | 
|  |  | 
|  | - Near constant time lookup for any single reference, even when the | 
|  | repository is cold and not in process or kernel cache. | 
|  | - Near constant time verification if a SHA-1 is referred to by at | 
|  | least one reference (for allow-tip-sha1-in-want). | 
|  | - Efficient lookup of an entire namespace, such as `refs/tags/`. | 
|  | - Support atomic push with `O(size_of_update)` operations. | 
|  | - Combine reflog storage with ref storage for small transactions. | 
|  | - Separate reflog storage for base refs and historical logs. | 
|  |  | 
|  | ### Description | 
|  |  | 
|  | A reftable file is a portable binary file format customized for | 
|  | reference storage. References are sorted, enabling linear scans, | 
|  | binary search lookup, and range scans. | 
|  |  | 
|  | Storage in the file is organized into variable sized blocks.  Prefix | 
|  | compression is used within a single block to reduce disk space.  Block | 
|  | size and alignment is tunable by the writer. | 
|  |  | 
|  | ### Performance | 
|  |  | 
|  | Space used, packed-refs vs. reftable: | 
|  |  | 
|  | repository | packed-refs | reftable | % original | avg ref  | avg obj | 
|  | -----------|------------:|---------:|-----------:|---------:|--------: | 
|  | android    |      62.2 M |   36.1 M |     58.0%  | 33 bytes | 5 bytes | 
|  | rails      |       1.8 M |    1.1 M |     57.7%  | 29 bytes | 4 bytes | 
|  | git        |      78.7 K |   48.1 K |     61.0%  | 50 bytes | 4 bytes | 
|  | git (heads)|       332 b |    269 b |     81.0%  | 33 bytes | 0 bytes | 
|  |  | 
|  | Scan (read 866k refs), by reference name lookup (single ref from 866k | 
|  | refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs): | 
|  |  | 
|  | format      | cache | scan    | by name        | by SHA-1 | 
|  | ------------|------:|--------:|---------------:|---------------: | 
|  | packed-refs | cold  |  402 ms | 409,660.1 usec | 412,535.8 usec | 
|  | packed-refs | hot   |         |   6,844.6 usec |  20,110.1 usec | 
|  | reftable    | cold  |  112 ms |      33.9 usec |     323.2 usec | 
|  | reftable    | hot   |         |      20.2 usec |     320.8 usec | 
|  |  | 
|  | Space used for 149,932 log entries for 43,061 refs, | 
|  | reflog vs. reftable: | 
|  |  | 
|  | format        | size  | avg entry | 
|  | --------------|------:|-----------: | 
|  | $GIT_DIR/logs | 173 M | 1209 bytes | 
|  | reftable      |   5 M |   37 bytes | 
|  |  | 
|  | ## Details | 
|  |  | 
|  | ### Peeling | 
|  |  | 
|  | References stored in a reftable are peeled, a record for an annotated | 
|  | (or signed) tag records both the tag object, and the object it refers | 
|  | to. | 
|  |  | 
|  | ### Reference name encoding | 
|  |  | 
|  | Reference names are an uninterpreted sequence of bytes that must pass | 
|  | [git-check-ref-format][ref-fmt] as a valid reference name. | 
|  |  | 
|  | [ref-fmt]: https://git-scm.com/docs/git-check-ref-format | 
|  |  | 
|  | ### Key unicity | 
|  |  | 
|  | Each entry must have a unique key; repeated keys are disallowed. | 
|  |  | 
|  | ### Network byte order | 
|  |  | 
|  | All multi-byte, fixed width fields are in network byte order. | 
|  |  | 
|  | ### Ordering | 
|  |  | 
|  | Blocks are lexicographically ordered by their first reference. | 
|  |  | 
|  | ### Directory/file conflicts | 
|  |  | 
|  | The reftable format accepts both `refs/heads/foo` and | 
|  | `refs/heads/foo/bar` as distinct references. | 
|  |  | 
|  | This property is useful for retaining log records in reftable, but may | 
|  | confuse versions of Git using `$GIT_DIR/refs` directory tree to | 
|  | maintain references.  Users of reftable may choose to continue to | 
|  | reject `foo` and `foo/bar` type conflicts to prevent problems for | 
|  | peers. | 
|  |  | 
|  | ## File format | 
|  |  | 
|  | ### Structure | 
|  |  | 
|  | A reftable file has the following high-level structure: | 
|  |  | 
|  | first_block { | 
|  | header | 
|  | first_ref_block | 
|  | } | 
|  | ref_block* | 
|  | ref_index* | 
|  | obj_block* | 
|  | obj_index* | 
|  | log_block* | 
|  | log_index* | 
|  | footer | 
|  |  | 
|  | A log-only file omits the `ref_block`, `ref_index`, `obj_block` and | 
|  | `obj_index` sections, containing only the file header and log block: | 
|  |  | 
|  | first_block { | 
|  | header | 
|  | } | 
|  | log_block* | 
|  | log_index* | 
|  | footer | 
|  |  | 
|  | in a log-only file the first log block immediately follows the file | 
|  | header, without padding to block alignment. | 
|  |  | 
|  | ### Block size | 
|  |  | 
|  | The file's block size is arbitrarily determined by the writer, and | 
|  | does not have to be a power of 2.  The block size must be larger than | 
|  | the longest reference name or log entry used in the repository, as | 
|  | references cannot span blocks. | 
|  |  | 
|  | Powers of two that are friendly to the virtual memory system or | 
|  | filesystem (such as 4k or 8k) are recommended.  Larger sizes (64k) can | 
|  | yield better compression, with a possible increased cost incurred by | 
|  | readers during access. | 
|  |  | 
|  | The largest block size is `16777215` bytes (15.99 MiB). | 
|  |  | 
|  | ### Block alignment | 
|  |  | 
|  | Writers may choose to align blocks at multiples of the block size by | 
|  | including `padding` filled with NUL bytes at the end of a block to | 
|  | round out to the chosen alignment.  When alignment is used, writers | 
|  | must specify the alignment with the file header's `block_size` field. | 
|  |  | 
|  | Block alignment is not required by the file format.  Unaligned files | 
|  | must set `block_size = 0` in the file header, and omit `padding`. | 
|  | Unaligned files with more than one ref block must include the | 
|  | [ref index](#Ref-index) to support fast lookup.  Readers must be | 
|  | able to read both aligned and non-aligned files. | 
|  |  | 
|  | Very small files (e.g. 1 only ref block) may omit `padding` and the | 
|  | ref index to reduce total file size. | 
|  |  | 
|  | ### Header | 
|  |  | 
|  | A 24-byte header appears at the beginning of the file: | 
|  |  | 
|  | 'REFT' | 
|  | uint8( version_number = 1 ) | 
|  | uint24( block_size ) | 
|  | uint64( min_update_index ) | 
|  | uint64( max_update_index ) | 
|  |  | 
|  | Aligned files must specify `block_size` to configure readers with the | 
|  | expected block alignment.  Unaligned files must set `block_size = 0`. | 
|  |  | 
|  | The `min_update_index` and `max_update_index` describe bounds for the | 
|  | `update_index` field of all log records in this file.  When reftables | 
|  | are used in a stack for [transactions](#Update-transactions), these | 
|  | fields can order the files such that the prior file's | 
|  | `max_update_index + 1` is the next file's `min_update_index`. | 
|  |  | 
|  | ### First ref block | 
|  |  | 
|  | The first ref block shares the same block as the file header, and is | 
|  | 24 bytes smaller than all other blocks in the file.  The first block | 
|  | immediately begins after the file header, at position 24. | 
|  |  | 
|  | If the first block is a log block (a log-only file), its block header | 
|  | begins immediately at position 24. | 
|  |  | 
|  | ### Ref block format | 
|  |  | 
|  | A ref block is written as: | 
|  |  | 
|  | 'r' | 
|  | uint24( block_len ) | 
|  | ref_record+ | 
|  | uint24( restart_offset )+ | 
|  | uint16( restart_count ) | 
|  |  | 
|  | padding? | 
|  |  | 
|  | Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which | 
|  | encodes the number of bytes in the block up to, but not including the | 
|  | optional `padding`.  This is always less than or equal to the file's | 
|  | block size.  In the first ref block, `block_len` includes 24 bytes | 
|  | for the file header. | 
|  |  | 
|  | The 2-byte `restart_count` stores the number of entries in the | 
|  | `restart_offset` list, which must not be empty.  Readers can use | 
|  | `restart_count` to binary search between restarts before starting a | 
|  | linear scan. | 
|  |  | 
|  | Exactly `restart_count` 3-byte `restart_offset` values precedes the | 
|  | `restart_count`.  Offsets are relative to the start of the block and | 
|  | refer to the first byte of any `ref_record` whose name has not been | 
|  | prefix compressed.  Entries in the `restart_offset` list must be | 
|  | sorted, ascending.  Readers can start linear scans from any of these | 
|  | records. | 
|  |  | 
|  | A variable number of `ref_record` fill the middle of the block, | 
|  | describing reference names and values.  The format is described below. | 
|  |  | 
|  | As the first ref block shares the first file block with the file | 
|  | header, all `restart_offset` in the first block are relative to the | 
|  | start of the file (position 0), and include the file header.  This | 
|  | forces the first `restart_offset` to be `28`. | 
|  |  | 
|  | #### ref record | 
|  |  | 
|  | A `ref_record` describes a single reference, storing both the name and | 
|  | its value(s). Records are formatted as: | 
|  |  | 
|  | varint( prefix_length ) | 
|  | varint( (suffix_length << 3) | value_type ) | 
|  | suffix | 
|  | varint( update_index_delta ) | 
|  | value? | 
|  |  | 
|  | The `prefix_length` field specifies how many leading bytes of the | 
|  | prior reference record's name should be copied to obtain this | 
|  | reference's name.  This must be 0 for the first reference in any | 
|  | block, and also must be 0 for any `ref_record` whose offset is listed | 
|  | in the `restart_offset` table at the end of the block. | 
|  |  | 
|  | Recovering a reference name from any `ref_record` is a simple concat: | 
|  |  | 
|  | this_name = prior_name[0..prefix_length] + suffix | 
|  |  | 
|  | The `suffix_length` value provides the number of bytes available in | 
|  | `suffix` to copy from `suffix` to complete the reference name. | 
|  |  | 
|  | The `update_index` that last modified the reference can be obtained by | 
|  | adding `update_index_delta` to the `min_update_index` from the file | 
|  | header: `min_update_index + update_index_delta`. | 
|  |  | 
|  | The `value` follows.  Its format is determined by `value_type`, one of | 
|  | the following: | 
|  |  | 
|  | - `0x0`: deletion; no value data (see transactions, below) | 
|  | - `0x1`: one 20-byte object id; value of the ref | 
|  | - `0x2`: two 20-byte object ids; value of the ref, peeled target | 
|  | - `0x3`: symbolic reference: `varint( target_len ) target` | 
|  |  | 
|  | Symbolic references use `0x3`, followed by the complete name of the | 
|  | reference target.  No compression is applied to the target name. | 
|  |  | 
|  | Types `0x4..0x7` are reserved for future use. | 
|  |  | 
|  | ### Ref index | 
|  |  | 
|  | The ref index stores the name of the last reference from every ref | 
|  | block in the file, enabling reduced disk seeks for lookups.  Any | 
|  | reference can be found by searching the index, identifying the | 
|  | containing block, and searching within that block. | 
|  |  | 
|  | The index may be organized into a multi-level index, where the 1st | 
|  | level index block points to additional ref index blocks (2nd level), | 
|  | which may in turn point to either additional index blocks (e.g. 3rd | 
|  | level) or ref blocks (leaf level).  Disk reads required to access a | 
|  | ref go up with higher index levels.  Multi-level indexes may be | 
|  | required to ensure no single index block exceeds the file format's max | 
|  | block size of `16777215` bytes (15.99 MiB).  To acheive constant O(1) | 
|  | disk seeks for lookups the index must be a single level, which is | 
|  | permitted to exceed the file's configured block size, but not the | 
|  | format's max block size of 15.99 MiB. | 
|  |  | 
|  | If present, the ref index block(s) appears after the last ref block. | 
|  |  | 
|  | If there are at least 4 ref blocks, a ref index block should be | 
|  | written to improve lookup times.  Cold reads using the index require | 
|  | 2 disk reads (read index, read block), and binary searching < 4 blocks | 
|  | also requires <= 2 reads.  Omitting the index block from smaller files | 
|  | saves space. | 
|  |  | 
|  | If the file is unaligned and contains more than one ref block, the ref | 
|  | index must be written. | 
|  |  | 
|  | Index block format: | 
|  |  | 
|  | 'i' | 
|  | uint24( block_len ) | 
|  | index_record+ | 
|  | uint24( restart_offset )+ | 
|  | uint16( restart_count ) | 
|  |  | 
|  | padding? | 
|  |  | 
|  | The index blocks begin with `block_type = 'i'` and a 3-byte | 
|  | `block_len` which encodes the number of bytes in the block, | 
|  | up to but not including the optional `padding`. | 
|  |  | 
|  | The `restart_offset` and `restart_count` fields are identical in | 
|  | format, meaning and usage as in ref blocks. | 
|  |  | 
|  | To reduce the number of reads required for random access in very large | 
|  | files the index block may be larger than other blocks.  However, | 
|  | readers must hold the entire index in memory to benefit from this, so | 
|  | it's a time-space tradeoff in both file size and reader memory. | 
|  |  | 
|  | Increasing the file's block size decreases the index size. | 
|  | Alternatively a multi-level index may be used, keeping index blocks | 
|  | within the file's block size, but increasing the number of blocks | 
|  | that need to be accessed. | 
|  |  | 
|  | #### index record | 
|  |  | 
|  | An index record describes the last entry in another block. | 
|  | Index records are written as: | 
|  |  | 
|  | varint( prefix_length ) | 
|  | varint( (suffix_length << 3) | 0 ) | 
|  | suffix | 
|  | varint( block_position ) | 
|  |  | 
|  | Index records use prefix compression exactly like `ref_record`. | 
|  |  | 
|  | Index records store `block_position` after the suffix, specifying the | 
|  | absolute position in bytes (from the start of the file) of the block | 
|  | that ends with this reference. Readers can seek to `block_position` to | 
|  | begin reading the block header. | 
|  |  | 
|  | Readers must examine the block header at `block_position` to determine | 
|  | if the next block is another level index block, or the leaf-level ref | 
|  | block. | 
|  |  | 
|  | #### Reading the index | 
|  |  | 
|  | Readers loading the ref index must first read the footer (below) to | 
|  | obtain `ref_index_position`. If not present, the position will be 0. | 
|  | The `ref_index_position` is for the 1st level root of the ref index. | 
|  |  | 
|  | ### Obj block format | 
|  |  | 
|  | Object blocks are optional.  Writers may choose to omit object blocks, | 
|  | especially if readers will not use the SHA-1 to ref mapping. | 
|  |  | 
|  | Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping | 
|  | to ref blocks containing references pointing to that object directly, | 
|  | or as the peeled value of an annotated tag.  Like ref blocks, object | 
|  | blocks use the file's standard block size. The abbrevation length is | 
|  | available in the footer as `obj_id_len`. | 
|  |  | 
|  | To save space in small files, object blocks may be omitted if the ref | 
|  | index is not present, as brute force search will only need to read a | 
|  | few ref blocks.  When missing, readers should brute force a linear | 
|  | search of all references to lookup by SHA-1. | 
|  |  | 
|  | An object block is written as: | 
|  |  | 
|  | 'o' | 
|  | uint24( block_len ) | 
|  | obj_record+ | 
|  | uint24( restart_offset )+ | 
|  | uint16( restart_count ) | 
|  |  | 
|  | padding? | 
|  |  | 
|  | Fields are identical to ref block.  Binary search using the restart | 
|  | table works the same as in reference blocks. | 
|  |  | 
|  | Because object identifiers are abbreviated by writers to the shortest | 
|  | unique abbreviation within the reftable, obj key lengths are variable | 
|  | between 2 and 20 bytes.  Readers must compare only for common prefix | 
|  | match within an obj block or obj index. | 
|  |  | 
|  | #### obj record | 
|  |  | 
|  | An `obj_record` describes a single object abbreviation, and the blocks | 
|  | containing references using that unique abbreviation: | 
|  |  | 
|  | varint( prefix_length ) | 
|  | varint( (suffix_length << 3) | cnt_3 ) | 
|  | suffix | 
|  | varint( cnt_large )? | 
|  | varint( position_delta )* | 
|  |  | 
|  | Like in reference blocks, abbreviations are prefix compressed within | 
|  | an obj block.  On large reftables with many unique objects, higher | 
|  | block sizes (64k), and higher restart interval (128), a | 
|  | `prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in | 
|  | obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits). | 
|  |  | 
|  | Each record contains `position_count` number of positions for matching | 
|  | ref blocks.  For 1-7 positions the count is stored in `cnt_3`.  When | 
|  | `cnt_3 = 0` the actual count follows in a varint, `cnt_large`. | 
|  |  | 
|  | The use of `cnt_3` bets most objects are pointed to by only a single | 
|  | reference, some may be pointed to by a couple of references, and very | 
|  | few (if any) are pointed to by more than 7 references. | 
|  |  | 
|  | A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there | 
|  | are no `position_delta`, but at least one reference starts with this | 
|  | abbreviation.  A reader that needs exact reference names must scan all | 
|  | references to find which specific references have the desired object. | 
|  | Writers should use this format when the `position_delta` list would have | 
|  | overflowed the file's block size due to a high number of references | 
|  | pointing to the same object. | 
|  |  | 
|  | The first `position_delta` is the position from the start of the file. | 
|  | Additional `position_delta` entries are sorted ascending and relative | 
|  | to the prior entry, e.g.  a reader would perform: | 
|  |  | 
|  | pos = position_delta[0] | 
|  | prior = pos | 
|  | for (j = 1; j < position_count; j++) { | 
|  | pos = prior + position_delta[j] | 
|  | prior = pos | 
|  | } | 
|  |  | 
|  | With a position in hand, a reader must linearly scan the ref block, | 
|  | starting from the first `ref_record`, testing each reference's SHA-1s | 
|  | (for `value_type = 0x1` or `0x2`) for full equality.  Faster searching | 
|  | by SHA-1 within a single ref block is not supported by the reftable | 
|  | format.  Smaller block sizes reduce the number of candidates this step | 
|  | must consider. | 
|  |  | 
|  | ### Obj index | 
|  |  | 
|  | The obj index stores the abbreviation from the last entry for every | 
|  | obj block in the file, enabling reduced disk seeks for all lookups. | 
|  | It is formatted exactly the same as the ref index, but refers to obj | 
|  | blocks. | 
|  |  | 
|  | The obj index should be present if obj blocks are present, as | 
|  | obj blocks should only be written in larger files. | 
|  |  | 
|  | Readers loading the obj index must first read the footer (below) to | 
|  | obtain `obj_index_position`.  If not present, the position will be 0. | 
|  |  | 
|  | ### Log block format | 
|  |  | 
|  | Unlike ref and obj blocks, log blocks are always unaligned. | 
|  |  | 
|  | Log blocks are variable in size, and do not match the `block_size` | 
|  | specified in the file header or footer.  Writers should choose an | 
|  | appropriate buffer size to prepare a log block for deflation, such as | 
|  | `2 * block_size`. | 
|  |  | 
|  | A log block is written as: | 
|  |  | 
|  | 'g' | 
|  | uint24( block_len ) | 
|  | zlib_deflate { | 
|  | log_record+ | 
|  | uint24( restart_offset )+ | 
|  | uint16( restart_count ) | 
|  | } | 
|  |  | 
|  | Log blocks look similar to ref blocks, except `block_type = 'g'`. | 
|  |  | 
|  | The 4-byte block header is followed by the deflated block contents | 
|  | using zlib deflate.  The `block_len` in the header is the inflated | 
|  | size (including 4-byte block header), and should be used by readers to | 
|  | preallocate the inflation output buffer.  A log block's `block_len` | 
|  | may exceed the file's block size. | 
|  |  | 
|  | Offsets within the log block (e.g.  `restart_offset`) still include | 
|  | the 4-byte header.  Readers may prefer prefixing the inflation output | 
|  | buffer with the 4-byte header. | 
|  |  | 
|  | Within the deflate container, a variable number of `log_record` | 
|  | describe reference changes.  The log record format is described | 
|  | below.  See ref block format (above) for a description of | 
|  | `restart_offset` and `restart_count`. | 
|  |  | 
|  | Because log blocks have no alignment or padding between blocks, | 
|  | readers must keep track of the bytes consumed by the inflater to | 
|  | know where the next log block begins. | 
|  |  | 
|  | #### log record | 
|  |  | 
|  | Log record keys are structured as: | 
|  |  | 
|  | ref_name '\0' reverse_int64( update_index ) | 
|  |  | 
|  | where `update_index` is the unique transaction identifier.  The | 
|  | `update_index` field must be unique within the scope of a `ref_name`. | 
|  | See the update transactions section below for further details. | 
|  |  | 
|  | The `reverse_int64` function inverses the value so lexographical | 
|  | ordering the network byte order encoding sorts the more recent records | 
|  | with higher `update_index` values first: | 
|  |  | 
|  | reverse_int64(int64 t) { | 
|  | return 0xffffffffffffffff - t; | 
|  | } | 
|  |  | 
|  | Log records have a similar starting structure to ref and index | 
|  | records, utilizing the same prefix compression scheme applied to the | 
|  | log record key described above. | 
|  |  | 
|  | ``` | 
|  | varint( prefix_length ) | 
|  | varint( (suffix_length << 3) | log_type ) | 
|  | suffix | 
|  | log_data { | 
|  | old_id | 
|  | new_id | 
|  | varint( name_length    )  name | 
|  | varint( email_length   )  email | 
|  | varint( time_seconds ) | 
|  | sint16( tz_offset ) | 
|  | varint( message_length )  message | 
|  | }? | 
|  | ``` | 
|  |  | 
|  | Log record entries use `log_type` to indicate what follows: | 
|  |  | 
|  | - `0x0`: deletion; no log data. | 
|  | - `0x1`: standard git reflog data using `log_data` above. | 
|  |  | 
|  | The `log_type = 0x0` is mostly useful for `git stash drop`, removing | 
|  | an entry from the reflog of `refs/stash` in a transaction file | 
|  | (below), without needing to rewrite larger files.  Readers reading a | 
|  | stack of reflogs must treat this as a deletion. | 
|  |  | 
|  | For `log_type = 0x1`, the `log_data` section follows | 
|  | [git update-ref][update-ref] logging, and includes: | 
|  |  | 
|  | - two 20-byte SHA-1s (old id, new id) | 
|  | - varint string of committer's name | 
|  | - varint string of committer's email | 
|  | - varint time in seconds since epoch (Jan 1, 1970) | 
|  | - 2-byte timezone offset in minutes (signed) | 
|  | - varint string of message | 
|  |  | 
|  | `tz_offset` is the absolute number of minutes from GMT the committer | 
|  | was at the time of the update.  For example `GMT-0800` is encoded in | 
|  | reftable as `sint16(-480)` and `GMT+0230` is `sint16(150)`. | 
|  |  | 
|  | The committer email does not contain `<` or `>`, it's the value | 
|  | normally found between the `<>` in a git commit object header. | 
|  |  | 
|  | The `message_length` may be 0, in which case there was no message | 
|  | supplied for the update. | 
|  |  | 
|  | [update-ref]: https://git-scm.com/docs/git-update-ref#_logging_updates | 
|  |  | 
|  | Contrary to traditional reflog (which is a file), renames are encoded as a | 
|  | combination of ref deletion and ref creation. | 
|  |  | 
|  |  | 
|  | #### Reading the log | 
|  |  | 
|  | Readers accessing the log must first read the footer (below) to | 
|  | determine the `log_position`.  The first block of the log begins at | 
|  | `log_position` bytes since the start of the file.  The `log_position` | 
|  | is not block aligned. | 
|  |  | 
|  | #### Importing logs | 
|  |  | 
|  | When importing from `$GIT_DIR/logs` writers should globally order all | 
|  | log records roughly by timestamp while preserving file order, and | 
|  | assign unique, increasing `update_index` values for each log line. | 
|  | Newer log records get higher `update_index` values. | 
|  |  | 
|  | Although an import may write only a single reftable file, the reftable | 
|  | file must span many unique `update_index`, as each log line requires | 
|  | its own `update_index` to preserve semantics. | 
|  |  | 
|  | ### Log index | 
|  |  | 
|  | The log index stores the log key (`refname \0 reverse_int64(update_index)`) | 
|  | for the last log record of every log block in the file, supporting | 
|  | bounded-time lookup. | 
|  |  | 
|  | A log index block must be written if 2 or more log blocks are written | 
|  | to the file.  If present, the log index appears after the last log | 
|  | block.  There is no padding used to align the log index to block | 
|  | alignment. | 
|  |  | 
|  | Log index format is identical to ref index, except the keys are 9 | 
|  | bytes longer to include `'\0'` and the 8-byte | 
|  | `reverse_int64(update_index)`.  Records use `block_position` to | 
|  | refer to the start of a log block. | 
|  |  | 
|  | #### Reading the index | 
|  |  | 
|  | Readers loading the log index must first read the footer (below) to | 
|  | obtain `log_index_position`. If not present, the position will be 0. | 
|  |  | 
|  | ### Footer | 
|  |  | 
|  | After the last block of the file, a file footer is written.  It begins | 
|  | like the file header, but is extended with additional data. | 
|  |  | 
|  | A 68-byte footer appears at the end: | 
|  |  | 
|  | ``` | 
|  | 'REFT' | 
|  | uint8( version_number = 1 ) | 
|  | uint24( block_size ) | 
|  | uint64( min_update_index ) | 
|  | uint64( max_update_index ) | 
|  |  | 
|  | uint64( ref_index_position ) | 
|  | uint64( (obj_position << 5) | obj_id_len ) | 
|  | uint64( obj_index_position ) | 
|  |  | 
|  | uint64( log_position ) | 
|  | uint64( log_index_position ) | 
|  |  | 
|  | uint32( CRC-32 of above ) | 
|  | ``` | 
|  |  | 
|  | If a section is missing (e.g. ref index) the corresponding position | 
|  | field (e.g. `ref_index_position`) will be 0. | 
|  |  | 
|  | - `obj_position`: byte position for the first obj block. | 
|  | - `obj_id_len`: number of bytes used to abbreviate object identifiers | 
|  | in obj blocks. | 
|  | - `log_position`: byte position for the first log block. | 
|  | - `ref_index_position`: byte position for the start of the ref index. | 
|  | - `obj_index_position`: byte position for the start of the obj index. | 
|  | - `log_index_position`: byte position for the start of the log index. | 
|  |  | 
|  | #### Reading the footer | 
|  |  | 
|  | Readers must seek to `file_length - 68` to access the footer.  A | 
|  | trusted external source (such as `stat(2)`) is necessary to obtain | 
|  | `file_length`.  When reading the footer, readers must verify: | 
|  |  | 
|  | - 4-byte magic is correct | 
|  | - 1-byte version number is recognized | 
|  | - 4-byte CRC-32 matches the other 64 bytes (including magic, and version) | 
|  |  | 
|  | Once verified, the other fields of the footer can be accessed. | 
|  |  | 
|  | ### Varint encoding | 
|  |  | 
|  | Varint encoding is identical to the ofs-delta encoding method used | 
|  | within pack files. | 
|  |  | 
|  | Decoder works such as: | 
|  |  | 
|  | val = buf[ptr] & 0x7f | 
|  | while (buf[ptr] & 0x80) { | 
|  | ptr++ | 
|  | val = ((val + 1) << 7) | (buf[ptr] & 0x7f) | 
|  | } | 
|  |  | 
|  | ### Binary search | 
|  |  | 
|  | Binary search within a block is supported by the `restart_offset` | 
|  | fields at the end of the block.  Readers can binary search through the | 
|  | restart table to locate between which two restart points the sought | 
|  | reference or key should appear. | 
|  |  | 
|  | Each record identified by a `restart_offset` stores the complete key | 
|  | in the `suffix` field of the record, making the compare operation | 
|  | during binary search straightforward. | 
|  |  | 
|  | Once a restart point lexicographically before the sought reference has | 
|  | been identified, readers can linearly scan through the following | 
|  | record entries to locate the sought record, terminating if the current | 
|  | record sorts after (and therefore the sought key is not present). | 
|  |  | 
|  | #### Restart point selection | 
|  |  | 
|  | Writers determine the restart points at file creation.  The process is | 
|  | arbitrary, but every 16 or 64 records is recommended.  Every 16 may | 
|  | be more suitable for smaller block sizes (4k or 8k), every 64 for | 
|  | larger block sizes (64k). | 
|  |  | 
|  | More frequent restart points reduces prefix compression and increases | 
|  | space consumed by the restart table, both of which increase file size. | 
|  |  | 
|  | Less frequent restart points makes prefix compression more effective, | 
|  | decreasing overall file size, with increased penalities for readers | 
|  | walking through more records after the binary search step. | 
|  |  | 
|  | A maximum of `65535` restart points per block is supported. | 
|  |  | 
|  | ## Considerations | 
|  |  | 
|  | ### Lightweight refs dominate | 
|  |  | 
|  | The reftable format assumes the vast majority of references are single | 
|  | SHA-1 valued with common prefixes, such as Gerrit Code Review's | 
|  | `refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many | 
|  | lightweight tags in the `refs/tags/` namespace. | 
|  |  | 
|  | Annotated tags storing the peeled object cost an additional 20 bytes | 
|  | per reference. | 
|  |  | 
|  | ### Low overhead | 
|  |  | 
|  | A reftable with very few references (e.g. git.git with 5 heads) | 
|  | is 269 bytes for reftable, vs. 332 bytes for packed-refs.  This | 
|  | supports reftable scaling down for transaction logs (below). | 
|  |  | 
|  | ### Block size | 
|  |  | 
|  | For a Gerrit Code Review type repository with many change refs, larger | 
|  | block sizes (64 KiB) and less frequent restart points (every 64) yield | 
|  | better compression due to more references within the block compressing | 
|  | against the prior reference. | 
|  |  | 
|  | Larger block sizes reduce the index size, as the reftable will | 
|  | require fewer blocks to store the same number of references. | 
|  |  | 
|  | ### Minimal disk seeks | 
|  |  | 
|  | Assuming the index block has been loaded into memory, binary searching | 
|  | for any single reference requires exactly 1 disk seek to load the | 
|  | containing block. | 
|  |  | 
|  | ### Scans and lookups dominate | 
|  |  | 
|  | Scanning all references and lookup by name (or namespace such as | 
|  | `refs/heads/`) are the most common activities performed on repositories. | 
|  | SHA-1s are stored directly with references to optimize this use case. | 
|  |  | 
|  | ### Logs are infrequently read | 
|  |  | 
|  | Logs are infrequently accessed, but can be large.  Deflating log | 
|  | blocks saves disk space, with some increased penalty at read time. | 
|  |  | 
|  | Logs are stored in an isolated section from refs, reducing the burden | 
|  | on reference readers that want to ignore logs.  Further, historical | 
|  | logs can be isolated into log-only files. | 
|  |  | 
|  | ### Logs are read backwards | 
|  |  | 
|  | Logs are frequently accessed backwards (most recent N records for | 
|  | master to answer `master@{4}`), so log records are grouped by | 
|  | reference, and sorted descending by update index. | 
|  |  | 
|  | ## Repository format | 
|  |  | 
|  | ### Version 1 | 
|  |  | 
|  | A repository must set its `$GIT_DIR/config` to configure reftable: | 
|  |  | 
|  | [core] | 
|  | repositoryformatversion = 1 | 
|  | [extensions] | 
|  | refStorage = reftable | 
|  |  | 
|  | ### Layout | 
|  |  | 
|  | A collection of reftable files are stored in the `$GIT_DIR/reftable/` | 
|  | directory: | 
|  |  | 
|  | 00000001-00000001.log | 
|  | 00000002-00000002.ref | 
|  | 00000003-00000003.ref | 
|  |  | 
|  | where reftable files are named by a unique name such as produced by | 
|  | the function `${min_update_index}-${max_update_index}.ref`. | 
|  |  | 
|  | Log-only files use the `.log` extension, while ref-only and mixed ref | 
|  | and log files use `.ref`.  extension. | 
|  |  | 
|  |  | 
|  | The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the current | 
|  | files, one per line, in order, from oldest (base) to newest (most recent): | 
|  |  | 
|  | $ cat .git/reftable/tables.list | 
|  | 00000001-00000001.log | 
|  | 00000002-00000002.ref | 
|  | 00000003-00000003.ref | 
|  |  | 
|  | Readers must read `$GIT_DIR/reftable/tables.list` to determine which files are | 
|  | relevant right now, and search through the stack in reverse order (last reftable | 
|  | is examined first). | 
|  |  | 
|  | Reftable files not listed in `tables.list` may be new (and about to be added | 
|  | to the stack by the active writer), or ancient and ready to be pruned. | 
|  |  | 
|  | ### Backward compatibility | 
|  |  | 
|  | Older clients should continue to recognize the directory as a git repository so | 
|  | they don't look for an enclosing repository in parent directories. To this end, | 
|  | a reftable-enabled repository must contain the following dummy files | 
|  |  | 
|  | *   `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`. | 
|  | *   `.git/refs/`, a directory | 
|  | *   `.git/refs/heads`, a regular file | 
|  |  | 
|  | ### Readers | 
|  |  | 
|  | Readers can obtain a consistent snapshot of the reference space by | 
|  | following: | 
|  |  | 
|  | 1.  Open and read the `tables.list` file. | 
|  | 2.  Open each of the reftable files that it mentions. | 
|  | 3.  If any of the files is missing, goto 1. | 
|  | 4.  Read from the now-open files as long as necessary. | 
|  |  | 
|  | ### Update transactions | 
|  |  | 
|  | Although reftables are immutable, mutations are supported by writing a | 
|  | new reftable and atomically appending it to the stack: | 
|  |  | 
|  | 1. Acquire `tables.list.lock`. | 
|  | 2. Read `tables.list` to determine current reftables. | 
|  | 3. Select `update_index` to be most recent file's `max_update_index + 1`. | 
|  | 4. Prepare temp reftable `tmp_XXXXXX`, including log entries. | 
|  | 5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}.ref`. | 
|  | 6. Copy `tables.list` to `tables.list.lock`, appending file from (5). | 
|  | 7. Rename `tables.list.lock` to `tables.list`. | 
|  |  | 
|  | During step 4 the new file's `min_update_index` and `max_update_index` | 
|  | are both set to the `update_index` selected by step 3.  All log | 
|  | records for the transaction use the same `update_index` in their keys. | 
|  | This enables later correlation of which references were updated by the | 
|  | same transaction. | 
|  |  | 
|  | Because a single `tables.list.lock` file is used to manage locking, the | 
|  | repository is single-threaded for writers.  Writers may have to | 
|  | busy-spin (with backoff) around creating `tables.list.lock`, for up to an | 
|  | acceptable wait period, aborting if the repository is too busy to | 
|  | mutate.  Application servers wrapped around repositories (e.g.  Gerrit | 
|  | Code Review) can layer their own lock/wait queue to improve fairness | 
|  | to writers. | 
|  |  | 
|  | ### Reference deletions | 
|  |  | 
|  | Deletion of any reference can be explicitly stored by setting the | 
|  | `type` to `0x0` and omitting the `value` field of the `ref_record`. | 
|  | This serves as a tombstone, overriding any assertions about the | 
|  | existence of the reference from earlier files in the stack. | 
|  |  | 
|  | ### Compaction | 
|  |  | 
|  | A partial stack of reftables can be compacted by merging references | 
|  | using a straightforward merge join across reftables, selecting the | 
|  | most recent value for output, and omitting deleted references that do | 
|  | not appear in remaining, lower reftables. | 
|  |  | 
|  | A compacted reftable should set its `min_update_index` to the smallest of | 
|  | the input files' `min_update_index`, and its `max_update_index` | 
|  | likewise to the largest input `max_update_index`. | 
|  |  | 
|  | For sake of illustration, assume the stack currently consists of | 
|  | reftable files (from oldest to newest): A, B, C, and D. The compactor | 
|  | is going to compact B and C, leaving A and D alone. | 
|  |  | 
|  | 1.  Obtain lock `tables.list.lock` and read the `tables.list` file. | 
|  | 2.  Obtain locks `B.lock` and `C.lock`. | 
|  | Ownership of these locks prevents other processes from trying | 
|  | to compact these files. | 
|  | 3.  Release `tables.list.lock`. | 
|  | 4.  Compact `B` and `C` into a temp file `${min_update_index}-${max_update_index}_XXXXXX`. | 
|  | 5.  Reacquire lock `tables.list.lock`. | 
|  | 6.  Verify that `B` and `C` are still in the stack, in that order. This | 
|  | should always be the case, assuming that other processes are adhering | 
|  | to the locking protocol. | 
|  | 7.  Rename `${min_update_index}-${max_update_index}_XXXXXX` to | 
|  | `${min_update_index}-${max_update_index}.ref`. | 
|  | 8.  Write the new stack to `tables.list.lock`, replacing `B` and `C` with the | 
|  | file from (4). | 
|  | 9.  Rename `tables.list.lock` to `tables.list`. | 
|  | 10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing | 
|  | readers to backtrack. | 
|  |  | 
|  | This strategy permits compactions to proceed independently of updates. | 
|  |  | 
|  | Each reftable (compacted or not) is uniquely identified by its name, so open | 
|  | reftables can be cached by their name. | 
|  |  | 
|  | ## Alternatives considered | 
|  |  | 
|  | ### bzip packed-refs | 
|  |  | 
|  | `bzip2` can significantly shrink a large packed-refs file (e.g. 62 | 
|  | MiB compresses to 23 MiB, 37%).  However the bzip format does not support | 
|  | random access to a single reference. Readers must inflate and discard | 
|  | while performing a linear scan. | 
|  |  | 
|  | Breaking packed-refs into chunks (individually compressing each chunk) | 
|  | would reduce the amount of data a reader must inflate, but still | 
|  | leaves the problem of indexing chunks to support readers efficiently | 
|  | locating the correct chunk. | 
|  |  | 
|  | Given the compression achieved by reftable's encoding, it does not | 
|  | seem necessary to add the complexity of bzip/gzip/zlib. | 
|  |  | 
|  | ### Michael Haggerty's alternate format | 
|  |  | 
|  | Michael Haggerty proposed [an alternate][mh-alt] format to reftable on | 
|  | the Git mailing list.  This format uses smaller chunks, without the | 
|  | restart table, and avoids block alignment with padding.  Reflog entries | 
|  | immediately follow each ref, and are thus interleaved between refs. | 
|  |  | 
|  | Performance testing indicates reftable is faster for lookups (51% | 
|  | faster, 11.2 usec vs.  5.4 usec), although reftable produces a | 
|  | slightly larger file (+ ~3.2%, 28.3M vs 29.2M): | 
|  |  | 
|  | format    |  size  | seek cold | seek hot  | | 
|  | ---------:|-------:|----------:|----------:| | 
|  | mh-alt    | 28.3 M | 23.4 usec | 11.2 usec | | 
|  | reftable  | 29.2 M | 19.9 usec |  5.4 usec | | 
|  |  | 
|  | [mh-alt]: https://public-inbox.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe+wGvEJ7A@mail.gmail.com/ | 
|  |  | 
|  | ### JGit Ketch RefTree | 
|  |  | 
|  | [JGit Ketch][ketch] proposed [RefTree][reftree], an encoding of | 
|  | references inside Git tree objects stored as part of the repository's | 
|  | object database. | 
|  |  | 
|  | The RefTree format adds additional load on the object database storage | 
|  | layer (more loose objects, more objects in packs), and relies heavily | 
|  | on the packer's delta compression to save space.  Namespaces which are | 
|  | flat (e.g.  thousands of tags in refs/tags) initially create very | 
|  | large loose objects, and so RefTree does not address the problem of | 
|  | copying many references to modify a handful. | 
|  |  | 
|  | Flat namespaces are not efficiently searchable in RefTree, as tree | 
|  | objects in canonical formatting cannot be binary searched. This fails | 
|  | the need to handle a large number of references in a single namespace, | 
|  | such as GitHub's `refs/pulls`, or a project with many tags. | 
|  |  | 
|  | [ketch]: https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html | 
|  | [reftree]: https://public-inbox.org/git/CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com/ | 
|  |  | 
|  | ### LMDB | 
|  |  | 
|  | David Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight | 
|  | (64k of runtime code) and GPL-compatible license. | 
|  |  | 
|  | A downside of LMDB is its reliance on a single C implementation.  This | 
|  | makes embedding inside JGit (a popular reimplemenation of Git) | 
|  | difficult, and hoisting onto virtual storage (for JGit DFS) virtually | 
|  | impossible. | 
|  |  | 
|  | A common format that can be supported by all major Git implementations | 
|  | (git-core, JGit, libgit2) is strongly preferred. | 
|  |  | 
|  | [dt-lmdb]: https://public-inbox.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/ | 
|  |  | 
|  | ## Future | 
|  |  | 
|  | ### Longer hashes | 
|  |  | 
|  | Version will bump (e.g.  2) to indicate `value` uses a different | 
|  | object id length other than 20.  The length could be stored in an | 
|  | expanded file header, or hardcoded as part of the version. |