diff options
Diffstat (limited to 'third_party/git/Documentation/technical/reftable.txt')
-rw-r--r-- | third_party/git/Documentation/technical/reftable.txt | 1083 |
1 files changed, 0 insertions, 1083 deletions
diff --git a/third_party/git/Documentation/technical/reftable.txt b/third_party/git/Documentation/technical/reftable.txt deleted file mode 100644 index 2951840e9c9b..000000000000 --- a/third_party/git/Documentation/technical/reftable.txt +++ /dev/null @@ -1,1083 +0,0 @@ -reftable --------- - -Overview -~~~~~~~~ - -Problem statement -^^^^^^^^^^^^^^^^^ - -Some repositories contain a lot of references (e.g. android at 866k, -rails at 31k). The existing packed-refs format takes up a lot of space -(e.g. 62M), and does not scale with additional references. Lookup of a -single reference requires linearly scanning the file. - -Atomic pushes modifying multiple references require copying the entire -packed-refs file, which can be a considerable amount of data moved -(e.g. 62M in, 62M out) for even small transactions (2 refs modified). - -Repositories with many loose references occupy a large number of disk -blocks from the local file system, as each reference is its own file -storing 41 bytes (and another file for the corresponding reflog). This -negatively affects the number of inodes available when a large number of -repositories are stored on the same filesystem. Readers can be penalized -due to the larger number of syscalls required to traverse and read the -`$GIT_DIR/refs` directory. - - -Objectives -^^^^^^^^^^ - -* Near constant time lookup for any single reference, even when the -repository is cold and not in process or kernel cache. -* Near constant time verification if an object name is referred to by at least -one reference (for allow-tip-sha1-in-want). -* Efficient enumeration of an entire namespace, such as `refs/tags/`. -* Support atomic push with `O(size_of_update)` operations. -* Combine reflog storage with ref storage for small transactions. -* Separate reflog storage for base refs and historical logs. - -Description -^^^^^^^^^^^ - -A reftable file is a portable binary file format customized for -reference storage. References are sorted, enabling linear scans, binary -search lookup, and range scans. - -Storage in the file is organized into variable sized blocks. Prefix -compression is used within a single block to reduce disk space. Block -size and alignment is tunable by the writer. - -Performance -^^^^^^^^^^^ - -Space used, packed-refs vs. reftable: - -[cols=",>,>,>,>,>",options="header",] -|=============================================================== -|repository |packed-refs |reftable |% original |avg ref |avg obj -|android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes -|rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes -|git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes -|git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes -|=============================================================== - -Scan (read 866k refs), by reference name lookup (single ref from 866k -refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs): - -[cols=",>,>,>,>",options="header",] -|========================================================= -|format |cache |scan |by name |by SHA-1 -|packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec -|packed-refs |hot | |6,844.6 usec |20,110.1 usec -|reftable |cold |112 ms |33.9 usec |323.2 usec -|reftable |hot | |20.2 usec |320.8 usec -|========================================================= - -Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable: - -[cols=",>,>",options="header",] -|================================ -|format |size |avg entry -|$GIT_DIR/logs |173 M |1209 bytes -|reftable |5 M |37 bytes -|================================ - -Details -~~~~~~~ - -Peeling -^^^^^^^ - -References stored in a reftable are peeled, a record for an annotated -(or signed) tag records both the tag object, and the object it refers -to. This is analogous to storage in the packed-refs format. - -Reference name encoding -^^^^^^^^^^^^^^^^^^^^^^^ - -Reference names are an uninterpreted sequence of bytes that must pass -linkgit:git-check-ref-format[1] as a valid reference name. - -Key unicity -^^^^^^^^^^^ - -Each entry must have a unique key; repeated keys are disallowed. - -Network byte order -^^^^^^^^^^^^^^^^^^ - -All multi-byte, fixed width fields are in network byte order. - -Varint encoding -^^^^^^^^^^^^^^^ - -Varint encoding is identical to the ofs-delta encoding method used -within pack files. - -Decoder works such as: - -.... -val = buf[ptr] & 0x7f -while (buf[ptr] & 0x80) { - ptr++ - val = ((val + 1) << 7) | (buf[ptr] & 0x7f) -} -.... - -Ordering -^^^^^^^^ - -Blocks are lexicographically ordered by their first reference. - -Directory/file conflicts -^^^^^^^^^^^^^^^^^^^^^^^^ - -The reftable format accepts both `refs/heads/foo` and -`refs/heads/foo/bar` as distinct references. - -This property is useful for retaining log records in reftable, but may -confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain -references. Users of reftable may choose to continue to reject `foo` and -`foo/bar` type conflicts to prevent problems for peers. - -File format -~~~~~~~~~~~ - -Structure -^^^^^^^^^ - -A reftable file has the following high-level structure: - -.... -first_block { - header - first_ref_block -} -ref_block* -ref_index* -obj_block* -obj_index* -log_block* -log_index* -footer -.... - -A log-only file omits the `ref_block`, `ref_index`, `obj_block` and -`obj_index` sections, containing only the file header and log block: - -.... -first_block { - header -} -log_block* -log_index* -footer -.... - -in a log-only file the first log block immediately follows the file -header, without padding to block alignment. - -Block size -^^^^^^^^^^ - -The file's block size is arbitrarily determined by the writer, and does -not have to be a power of 2. The block size must be larger than the -longest reference name or log entry used in the repository, as -references cannot span blocks. - -Powers of two that are friendly to the virtual memory system or -filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can -yield better compression, with a possible increased cost incurred by -readers during access. - -The largest block size is `16777215` bytes (15.99 MiB). - -Block alignment -^^^^^^^^^^^^^^^ - -Writers may choose to align blocks at multiples of the block size by -including `padding` filled with NUL bytes at the end of a block to round -out to the chosen alignment. When alignment is used, writers must -specify the alignment with the file header's `block_size` field. - -Block alignment is not required by the file format. Unaligned files must -set `block_size = 0` in the file header, and omit `padding`. Unaligned -files with more than one ref block must include the link:#Ref-index[ref -index] to support fast lookup. Readers must be able to read both aligned -and non-aligned files. - -Very small files (e.g. a single ref block) may omit `padding` and the ref -index to reduce total file size. - -Header (version 1) -^^^^^^^^^^^^^^^^^^ - -A 24-byte header appears at the beginning of the file: - -.... -'REFT' -uint8( version_number = 1 ) -uint24( block_size ) -uint64( min_update_index ) -uint64( max_update_index ) -.... - -Aligned files must specify `block_size` to configure readers with the -expected block alignment. Unaligned files must set `block_size = 0`. - -The `min_update_index` and `max_update_index` describe bounds for the -`update_index` field of all log records in this file. When reftables are -used in a stack for link:#Update-transactions[transactions], these -fields can order the files such that the prior file's -`max_update_index + 1` is the next file's `min_update_index`. - -Header (version 2) -^^^^^^^^^^^^^^^^^^ - -A 28-byte header appears at the beginning of the file: - -.... -'REFT' -uint8( version_number = 2 ) -uint24( block_size ) -uint64( min_update_index ) -uint64( max_update_index ) -uint32( hash_id ) -.... - -The header is identical to `version_number=1`, with the 4-byte hash ID -("sha1" for SHA1 and "s256" for SHA-256) append to the header. - -For maximum backward compatibility, it is recommended to use version 1 when -writing SHA1 reftables. - -First ref block -^^^^^^^^^^^^^^^ - -The first ref block shares the same block as the file header, and is 24 -bytes smaller than all other blocks in the file. The first block -immediately begins after the file header, at position 24. - -If the first block is a log block (a log-only file), its block header -begins immediately at position 24. - -Ref block format -^^^^^^^^^^^^^^^^ - -A ref block is written as: - -.... -'r' -uint24( block_len ) -ref_record+ -uint24( restart_offset )+ -uint16( restart_count ) - -padding? -.... - -Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which -encodes the number of bytes in the block up to, but not including the -optional `padding`. This is always less than or equal to the file's -block size. In the first ref block, `block_len` includes 24 bytes for -the file header. - -The 2-byte `restart_count` stores the number of entries in the -`restart_offset` list, which must not be empty. Readers can use -`restart_count` to binary search between restarts before starting a -linear scan. - -Exactly `restart_count` 3-byte `restart_offset` values precedes the -`restart_count`. Offsets are relative to the start of the block and -refer to the first byte of any `ref_record` whose name has not been -prefix compressed. Entries in the `restart_offset` list must be sorted, -ascending. Readers can start linear scans from any of these records. - -A variable number of `ref_record` fill the middle of the block, -describing reference names and values. The format is described below. - -As the first ref block shares the first file block with the file header, -all `restart_offset` in the first block are relative to the start of the -file (position 0), and include the file header. This forces the first -`restart_offset` to be `28`. - -ref record -++++++++++ - -A `ref_record` describes a single reference, storing both the name and -its value(s). Records are formatted as: - -.... -varint( prefix_length ) -varint( (suffix_length << 3) | value_type ) -suffix -varint( update_index_delta ) -value? -.... - -The `prefix_length` field specifies how many leading bytes of the prior -reference record's name should be copied to obtain this reference's -name. This must be 0 for the first reference in any block, and also must -be 0 for any `ref_record` whose offset is listed in the `restart_offset` -table at the end of the block. - -Recovering a reference name from any `ref_record` is a simple concat: - -.... -this_name = prior_name[0..prefix_length] + suffix -.... - -The `suffix_length` value provides the number of bytes available in -`suffix` to copy from `suffix` to complete the reference name. - -The `update_index` that last modified the reference can be obtained by -adding `update_index_delta` to the `min_update_index` from the file -header: `min_update_index + update_index_delta`. - -The `value` follows. Its format is determined by `value_type`, one of -the following: - -* `0x0`: deletion; no value data (see transactions, below) -* `0x1`: one object name; value of the ref -* `0x2`: two object names; value of the ref, peeled target -* `0x3`: symbolic reference: `varint( target_len ) target` - -Symbolic references use `0x3`, followed by the complete name of the -reference target. No compression is applied to the target name. - -Types `0x4..0x7` are reserved for future use. - -Ref index -^^^^^^^^^ - -The ref index stores the name of the last reference from every ref block -in the file, enabling reduced disk seeks for lookups. Any reference can -be found by searching the index, identifying the containing block, and -searching within that block. - -The index may be organized into a multi-level index, where the 1st level -index block points to additional ref index blocks (2nd level), which may -in turn point to either additional index blocks (e.g. 3rd level) or ref -blocks (leaf level). Disk reads required to access a ref go up with -higher index levels. Multi-level indexes may be required to ensure no -single index block exceeds the file format's max block size of -`16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for -lookups the index must be a single level, which is permitted to exceed -the file's configured block size, but not the format's max block size of -15.99 MiB. - -If present, the ref index block(s) appears after the last ref block. - -If there are at least 4 ref blocks, a ref index block should be written -to improve lookup times. Cold reads using the index require 2 disk reads -(read index, read block), and binary searching < 4 blocks also requires -<= 2 reads. Omitting the index block from smaller files saves space. - -If the file is unaligned and contains more than one ref block, the ref -index must be written. - -Index block format: - -.... -'i' -uint24( block_len ) -index_record+ -uint24( restart_offset )+ -uint16( restart_count ) - -padding? -.... - -The index blocks begin with `block_type = 'i'` and a 3-byte `block_len` -which encodes the number of bytes in the block, up to but not including -the optional `padding`. - -The `restart_offset` and `restart_count` fields are identical in format, -meaning and usage as in ref blocks. - -To reduce the number of reads required for random access in very large -files the index block may be larger than other blocks. However, readers -must hold the entire index in memory to benefit from this, so it's a -time-space tradeoff in both file size and reader memory. - -Increasing the file's block size decreases the index size. Alternatively -a multi-level index may be used, keeping index blocks within the file's -block size, but increasing the number of blocks that need to be -accessed. - -index record -++++++++++++ - -An index record describes the last entry in another block. Index records -are written as: - -.... -varint( prefix_length ) -varint( (suffix_length << 3) | 0 ) -suffix -varint( block_position ) -.... - -Index records use prefix compression exactly like `ref_record`. - -Index records store `block_position` after the suffix, specifying the -absolute position in bytes (from the start of the file) of the block -that ends with this reference. Readers can seek to `block_position` to -begin reading the block header. - -Readers must examine the block header at `block_position` to determine -if the next block is another level index block, or the leaf-level ref -block. - -Reading the index -+++++++++++++++++ - -Readers loading the ref index must first read the footer (below) to -obtain `ref_index_position`. If not present, the position will be 0. The -`ref_index_position` is for the 1st level root of the ref index. - -Obj block format -^^^^^^^^^^^^^^^^ - -Object blocks are optional. Writers may choose to omit object blocks, -especially if readers will not use the object name to ref mapping. - -Object blocks use unique, abbreviated 2-32 object name keys, mapping to -ref blocks containing references pointing to that object directly, or as -the peeled value of an annotated tag. Like ref blocks, object blocks use -the file's standard block size. The abbrevation length is available in -the footer as `obj_id_len`. - -To save space in small files, object blocks may be omitted if the ref -index is not present, as brute force search will only need to read a few -ref blocks. When missing, readers should brute force a linear search of -all references to lookup by object name. - -An object block is written as: - -.... -'o' -uint24( block_len ) -obj_record+ -uint24( restart_offset )+ -uint16( restart_count ) - -padding? -.... - -Fields are identical to ref block. Binary search using the restart table -works the same as in reference blocks. - -Because object names are abbreviated by writers to the shortest unique -abbreviation within the reftable, obj key lengths have a variable length. Their -length must be at least 2 bytes. Readers must compare only for common prefix -match within an obj block or obj index. - -obj record -++++++++++ - -An `obj_record` describes a single object abbreviation, and the blocks -containing references using that unique abbreviation: - -.... -varint( prefix_length ) -varint( (suffix_length << 3) | cnt_3 ) -suffix -varint( cnt_large )? -varint( position_delta )* -.... - -Like in reference blocks, abbreviations are prefix compressed within an -obj block. On large reftables with many unique objects, higher block -sizes (64k), and higher restart interval (128), a `prefix_length` of 2 -or 3 and `suffix_length` of 3 may be common in obj records (unique -abbreviation of 5-6 raw bytes, 10-12 hex digits). - -Each record contains `position_count` number of positions for matching -ref blocks. For 1-7 positions the count is stored in `cnt_3`. When -`cnt_3 = 0` the actual count follows in a varint, `cnt_large`. - -The use of `cnt_3` bets most objects are pointed to by only a single -reference, some may be pointed to by a couple of references, and very -few (if any) are pointed to by more than 7 references. - -A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no -`position_delta`, but at least one reference starts with this -abbreviation. A reader that needs exact reference names must scan all -references to find which specific references have the desired object. -Writers should use this format when the `position_delta` list would have -overflowed the file's block size due to a high number of references -pointing to the same object. - -The first `position_delta` is the position from the start of the file. -Additional `position_delta` entries are sorted ascending and relative to -the prior entry, e.g. a reader would perform: - -.... -pos = position_delta[0] -prior = pos -for (j = 1; j < position_count; j++) { - pos = prior + position_delta[j] - prior = pos -} -.... - -With a position in hand, a reader must linearly scan the ref block, -starting from the first `ref_record`, testing each reference's object names -(for `value_type = 0x1` or `0x2`) for full equality. Faster searching by -object name within a single ref block is not supported by the reftable format. -Smaller block sizes reduce the number of candidates this step must -consider. - -Obj index -^^^^^^^^^ - -The obj index stores the abbreviation from the last entry for every obj -block in the file, enabling reduced disk seeks for all lookups. It is -formatted exactly the same as the ref index, but refers to obj blocks. - -The obj index should be present if obj blocks are present, as obj blocks -should only be written in larger files. - -Readers loading the obj index must first read the footer (below) to -obtain `obj_index_position`. If not present, the position will be 0. - -Log block format -^^^^^^^^^^^^^^^^ - -Unlike ref and obj blocks, log blocks are always unaligned. - -Log blocks are variable in size, and do not match the `block_size` -specified in the file header or footer. Writers should choose an -appropriate buffer size to prepare a log block for deflation, such as -`2 * block_size`. - -A log block is written as: - -.... -'g' -uint24( block_len ) -zlib_deflate { - log_record+ - uint24( restart_offset )+ - uint16( restart_count ) -} -.... - -Log blocks look similar to ref blocks, except `block_type = 'g'`. - -The 4-byte block header is followed by the deflated block contents using -zlib deflate. The `block_len` in the header is the inflated size -(including 4-byte block header), and should be used by readers to -preallocate the inflation output buffer. A log block's `block_len` may -exceed the file's block size. - -Offsets within the log block (e.g. `restart_offset`) still include the -4-byte header. Readers may prefer prefixing the inflation output buffer -with the 4-byte header. - -Within the deflate container, a variable number of `log_record` describe -reference changes. The log record format is described below. See ref -block format (above) for a description of `restart_offset` and -`restart_count`. - -Because log blocks have no alignment or padding between blocks, readers -must keep track of the bytes consumed by the inflater to know where the -next log block begins. - -log record -++++++++++ - -Log record keys are structured as: - -.... -ref_name '\0' reverse_int64( update_index ) -.... - -where `update_index` is the unique transaction identifier. The -`update_index` field must be unique within the scope of a `ref_name`. -See the update transactions section below for further details. - -The `reverse_int64` function inverses the value so lexicographical -ordering the network byte order encoding sorts the more recent records -with higher `update_index` values first: - -.... -reverse_int64(int64 t) { - return 0xffffffffffffffff - t; -} -.... - -Log records have a similar starting structure to ref and index records, -utilizing the same prefix compression scheme applied to the log record -key described above. - -.... - varint( prefix_length ) - varint( (suffix_length << 3) | log_type ) - suffix - log_data { - old_id - new_id - varint( name_length ) name - varint( email_length ) email - varint( time_seconds ) - sint16( tz_offset ) - varint( message_length ) message - }? -.... - -Log record entries use `log_type` to indicate what follows: - -* `0x0`: deletion; no log data. -* `0x1`: standard git reflog data using `log_data` above. - -The `log_type = 0x0` is mostly useful for `git stash drop`, removing an -entry from the reflog of `refs/stash` in a transaction file (below), -without needing to rewrite larger files. Readers reading a stack of -reflogs must treat this as a deletion. - -For `log_type = 0x1`, the `log_data` section follows -linkgit:git-update-ref[1] logging and includes: - -* two object names (old id, new id) -* varint string of committer's name -* varint string of committer's email -* varint time in seconds since epoch (Jan 1, 1970) -* 2-byte timezone offset in minutes (signed) -* varint string of message - -`tz_offset` is the absolute number of minutes from GMT the committer was -at the time of the update. For example `GMT-0800` is encoded in reftable -as `sint16(-480)` and `GMT+0230` is `sint16(150)`. - -The committer email does not contain `<` or `>`, it's the value normally -found between the `<>` in a git commit object header. - -The `message_length` may be 0, in which case there was no message -supplied for the update. - -Contrary to traditional reflog (which is a file), renames are encoded as -a combination of ref deletion and ref creation. A deletion is a log -record with a zero new_id, and a creation is a log record with a zero old_id. - -Reading the log -+++++++++++++++ - -Readers accessing the log must first read the footer (below) to -determine the `log_position`. The first block of the log begins at -`log_position` bytes since the start of the file. The `log_position` is -not block aligned. - -Importing logs -++++++++++++++ - -When importing from `$GIT_DIR/logs` writers should globally order all -log records roughly by timestamp while preserving file order, and assign -unique, increasing `update_index` values for each log line. Newer log -records get higher `update_index` values. - -Although an import may write only a single reftable file, the reftable -file must span many unique `update_index`, as each log line requires its -own `update_index` to preserve semantics. - -Log index -^^^^^^^^^ - -The log index stores the log key -(`refname \0 reverse_int64(update_index)`) for the last log record of -every log block in the file, supporting bounded-time lookup. - -A log index block must be written if 2 or more log blocks are written to -the file. If present, the log index appears after the last log block. -There is no padding used to align the log index to block alignment. - -Log index format is identical to ref index, except the keys are 9 bytes -longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`. -Records use `block_position` to refer to the start of a log block. - -Reading the index -+++++++++++++++++ - -Readers loading the log index must first read the footer (below) to -obtain `log_index_position`. If not present, the position will be 0. - -Footer -^^^^^^ - -After the last block of the file, a file footer is written. It begins -like the file header, but is extended with additional data. - -.... - HEADER - - uint64( ref_index_position ) - uint64( (obj_position << 5) | obj_id_len ) - uint64( obj_index_position ) - - uint64( log_position ) - uint64( log_index_position ) - - uint32( CRC-32 of above ) -.... - -If a section is missing (e.g. ref index) the corresponding position -field (e.g. `ref_index_position`) will be 0. - -* `obj_position`: byte position for the first obj block. -* `obj_id_len`: number of bytes used to abbreviate object names in -obj blocks. -* `log_position`: byte position for the first log block. -* `ref_index_position`: byte position for the start of the ref index. -* `obj_index_position`: byte position for the start of the obj index. -* `log_index_position`: byte position for the start of the log index. - -The size of the footer is 68 bytes for version 1, and 72 bytes for -version 2. - -Reading the footer -++++++++++++++++++ - -Readers must first read the file start to determine the version -number. Then they seek to `file_length - FOOTER_LENGTH` to access the -footer. A trusted external source (such as `stat(2)`) is necessary to -obtain `file_length`. When reading the footer, readers must verify: - -* 4-byte magic is correct -* 1-byte version number is recognized -* 4-byte CRC-32 matches the other 64 bytes (including magic, and -version) - -Once verified, the other fields of the footer can be accessed. - -Empty tables -++++++++++++ - -A reftable may be empty. In this case, the file starts with a header -and is immediately followed by a footer. - -Binary search -^^^^^^^^^^^^^ - -Binary search within a block is supported by the `restart_offset` fields -at the end of the block. Readers can binary search through the restart -table to locate between which two restart points the sought reference or -key should appear. - -Each record identified by a `restart_offset` stores the complete key in -the `suffix` field of the record, making the compare operation during -binary search straightforward. - -Once a restart point lexicographically before the sought reference has -been identified, readers can linearly scan through the following record -entries to locate the sought record, terminating if the current record -sorts after (and therefore the sought key is not present). - -Restart point selection -+++++++++++++++++++++++ - -Writers determine the restart points at file creation. The process is -arbitrary, but every 16 or 64 records is recommended. Every 16 may be -more suitable for smaller block sizes (4k or 8k), every 64 for larger -block sizes (64k). - -More frequent restart points reduces prefix compression and increases -space consumed by the restart table, both of which increase file size. - -Less frequent restart points makes prefix compression more effective, -decreasing overall file size, with increased penalties for readers -walking through more records after the binary search step. - -A maximum of `65535` restart points per block is supported. - -Considerations -~~~~~~~~~~~~~~ - -Lightweight refs dominate -^^^^^^^^^^^^^^^^^^^^^^^^^ - -The reftable format assumes the vast majority of references are single -object names valued with common prefixes, such as Gerrit Code Review's -`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many -lightweight tags in the `refs/tags/` namespace. - -Annotated tags storing the peeled object cost an additional object name per -reference. - -Low overhead -^^^^^^^^^^^^ - -A reftable with very few references (e.g. git.git with 5 heads) is 269 -bytes for reftable, vs. 332 bytes for packed-refs. This supports -reftable scaling down for transaction logs (below). - -Block size -^^^^^^^^^^ - -For a Gerrit Code Review type repository with many change refs, larger -block sizes (64 KiB) and less frequent restart points (every 64) yield -better compression due to more references within the block compressing -against the prior reference. - -Larger block sizes reduce the index size, as the reftable will require -fewer blocks to store the same number of references. - -Minimal disk seeks -^^^^^^^^^^^^^^^^^^ - -Assuming the index block has been loaded into memory, binary searching -for any single reference requires exactly 1 disk seek to load the -containing block. - -Scans and lookups dominate -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Scanning all references and lookup by name (or namespace such as -`refs/heads/`) are the most common activities performed on repositories. -Object names are stored directly with references to optimize this use case. - -Logs are infrequently read -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Logs are infrequently accessed, but can be large. Deflating log blocks -saves disk space, with some increased penalty at read time. - -Logs are stored in an isolated section from refs, reducing the burden on -reference readers that want to ignore logs. Further, historical logs can -be isolated into log-only files. - -Logs are read backwards -^^^^^^^^^^^^^^^^^^^^^^^ - -Logs are frequently accessed backwards (most recent N records for master -to answer `master@{4}`), so log records are grouped by reference, and -sorted descending by update index. - -Repository format -~~~~~~~~~~~~~~~~~ - -Version 1 -^^^^^^^^^ - -A repository must set its `$GIT_DIR/config` to configure reftable: - -.... -[core] - repositoryformatversion = 1 -[extensions] - refStorage = reftable -.... - -Layout -^^^^^^ - -A collection of reftable files are stored in the `$GIT_DIR/reftable/` -directory: - -.... -00000001-00000001.log -00000002-00000002.ref -00000003-00000003.ref -.... - -where reftable files are named by a unique name such as produced by the -function `${min_update_index}-${max_update_index}.ref`. - -Log-only files use the `.log` extension, while ref-only and mixed ref -and log files use `.ref`. extension. - -The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the -current files, one per line, in order, from oldest (base) to newest -(most recent): - -.... -$ cat .git/reftable/tables.list -00000001-00000001.log -00000002-00000002.ref -00000003-00000003.ref -.... - -Readers must read `$GIT_DIR/reftable/tables.list` to determine which -files are relevant right now, and search through the stack in reverse -order (last reftable is examined first). - -Reftable files not listed in `tables.list` may be new (and about to be -added to the stack by the active writer), or ancient and ready to be -pruned. - -Backward compatibility -^^^^^^^^^^^^^^^^^^^^^^ - -Older clients should continue to recognize the directory as a git -repository so they don't look for an enclosing repository in parent -directories. To this end, a reftable-enabled repository must contain the -following dummy files - -* `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`. -* `.git/refs/`, a directory -* `.git/refs/heads`, a regular file - -Readers -^^^^^^^ - -Readers can obtain a consistent snapshot of the reference space by -following: - -1. Open and read the `tables.list` file. -2. Open each of the reftable files that it mentions. -3. If any of the files is missing, goto 1. -4. Read from the now-open files as long as necessary. - -Update transactions -^^^^^^^^^^^^^^^^^^^ - -Although reftables are immutable, mutations are supported by writing a -new reftable and atomically appending it to the stack: - -1. Acquire `tables.list.lock`. -2. Read `tables.list` to determine current reftables. -3. Select `update_index` to be most recent file's -`max_update_index + 1`. -4. Prepare temp reftable `tmp_XXXXXX`, including log entries. -5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}.ref`. -6. Copy `tables.list` to `tables.list.lock`, appending file from (5). -7. Rename `tables.list.lock` to `tables.list`. - -During step 4 the new file's `min_update_index` and `max_update_index` -are both set to the `update_index` selected by step 3. All log records -for the transaction use the same `update_index` in their keys. This -enables later correlation of which references were updated by the same -transaction. - -Because a single `tables.list.lock` file is used to manage locking, the -repository is single-threaded for writers. Writers may have to busy-spin -(with backoff) around creating `tables.list.lock`, for up to an -acceptable wait period, aborting if the repository is too busy to -mutate. Application servers wrapped around repositories (e.g. Gerrit -Code Review) can layer their own lock/wait queue to improve fairness to -writers. - -Reference deletions -^^^^^^^^^^^^^^^^^^^ - -Deletion of any reference can be explicitly stored by setting the `type` -to `0x0` and omitting the `value` field of the `ref_record`. This serves -as a tombstone, overriding any assertions about the existence of the -reference from earlier files in the stack. - -Compaction -^^^^^^^^^^ - -A partial stack of reftables can be compacted by merging references -using a straightforward merge join across reftables, selecting the most -recent value for output, and omitting deleted references that do not -appear in remaining, lower reftables. - -A compacted reftable should set its `min_update_index` to the smallest -of the input files' `min_update_index`, and its `max_update_index` -likewise to the largest input `max_update_index`. - -For sake of illustration, assume the stack currently consists of -reftable files (from oldest to newest): A, B, C, and D. The compactor is -going to compact B and C, leaving A and D alone. - -1. Obtain lock `tables.list.lock` and read the `tables.list` file. -2. Obtain locks `B.lock` and `C.lock`. Ownership of these locks -prevents other processes from trying to compact these files. -3. Release `tables.list.lock`. -4. Compact `B` and `C` into a temp file -`${min_update_index}-${max_update_index}_XXXXXX`. -5. Reacquire lock `tables.list.lock`. -6. Verify that `B` and `C` are still in the stack, in that order. This -should always be the case, assuming that other processes are adhering to -the locking protocol. -7. Rename `${min_update_index}-${max_update_index}_XXXXXX` to -`${min_update_index}-${max_update_index}.ref`. -8. Write the new stack to `tables.list.lock`, replacing `B` and `C` -with the file from (4). -9. Rename `tables.list.lock` to `tables.list`. -10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing -readers to backtrack. - -This strategy permits compactions to proceed independently of updates. - -Each reftable (compacted or not) is uniquely identified by its name, so -open reftables can be cached by their name. - -Alternatives considered -~~~~~~~~~~~~~~~~~~~~~~~ - -bzip packed-refs -^^^^^^^^^^^^^^^^ - -`bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB -compresses to 23 MiB, 37%). However the bzip format does not support -random access to a single reference. Readers must inflate and discard -while performing a linear scan. - -Breaking packed-refs into chunks (individually compressing each chunk) -would reduce the amount of data a reader must inflate, but still leaves -the problem of indexing chunks to support readers efficiently locating -the correct chunk. - -Given the compression achieved by reftable's encoding, it does not seem -necessary to add the complexity of bzip/gzip/zlib. - -Michael Haggerty's alternate format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Michael Haggerty proposed -link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an -alternate] format to reftable on the Git mailing list. This format uses -smaller chunks, without the restart table, and avoids block alignment -with padding. Reflog entries immediately follow each ref, and are thus -interleaved between refs. - -Performance testing indicates reftable is faster for lookups (51% -faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly -larger file (+ ~3.2%, 28.3M vs 29.2M): - -[cols=">,>,>,>",options="header",] -|===================================== -|format |size |seek cold |seek hot -|mh-alt |28.3 M |23.4 usec |11.2 usec -|reftable |29.2 M |19.9 usec |5.4 usec -|===================================== - -JGit Ketch RefTree -^^^^^^^^^^^^^^^^^^ - -https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch] -proposed -link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree], -an encoding of references inside Git tree objects stored as part of the -repository's object database. - -The RefTree format adds additional load on the object database storage -layer (more loose objects, more objects in packs), and relies heavily on -the packer's delta compression to save space. Namespaces which are flat -(e.g. thousands of tags in refs/tags) initially create very large loose -objects, and so RefTree does not address the problem of copying many -references to modify a handful. - -Flat namespaces are not efficiently searchable in RefTree, as tree -objects in canonical formatting cannot be binary searched. This fails -the need to handle a large number of references in a single namespace, -such as GitHub's `refs/pulls`, or a project with many tags. - -LMDB -^^^^ - -David Turner proposed -https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using -LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible -license. - -A downside of LMDB is its reliance on a single C implementation. This -makes embedding inside JGit (a popular reimplementation of Git) -difficult, and hoisting onto virtual storage (for JGit DFS) virtually -impossible. - -A common format that can be supported by all major Git implementations -(git-core, JGit, libgit2) is strongly preferred. |