diff options
Diffstat (limited to 'third_party/git/Documentation/technical/commit-graph.txt')
-rw-r--r-- | third_party/git/Documentation/technical/commit-graph.txt | 350 |
1 files changed, 0 insertions, 350 deletions
diff --git a/third_party/git/Documentation/technical/commit-graph.txt b/third_party/git/Documentation/technical/commit-graph.txt deleted file mode 100644 index f14a7659aa87..000000000000 --- a/third_party/git/Documentation/technical/commit-graph.txt +++ /dev/null @@ -1,350 +0,0 @@ -Git Commit Graph Design Notes -============================= - -Git walks the commit graph for many reasons, including: - -1. Listing and filtering commit history. -2. Computing merge bases. - -These operations can become slow as the commit count grows. The merge -base calculation shows up in many user-facing commands, such as 'merge-base' -or 'status' and can take minutes to compute depending on history shape. - -There are two main costs here: - -1. Decompressing and parsing commits. -2. Walking the entire graph to satisfy topological order constraints. - -The commit-graph file is a supplemental data structure that accelerates -commit graph walks. If a user downgrades or disables the 'core.commitGraph' -config setting, then the existing ODB is sufficient. The file is stored -as "commit-graph" either in the .git/objects/info directory or in the info -directory of an alternate. - -The commit-graph file stores the commit graph structure along with some -extra metadata to speed up graph walks. By listing commit OIDs in -lexicographic order, we can identify an integer position for each commit -and refer to the parents of a commit using those integer positions. We -use binary search to find initial commits and then use the integer -positions for fast lookups during the walk. - -A consumer may load the following info for a commit from the graph: - -1. The commit OID. -2. The list of parents, along with their integer position. -3. The commit date. -4. The root tree OID. -5. The generation number (see definition below). - -Values 1-4 satisfy the requirements of parse_commit_gently(). - -Define the "generation number" of a commit recursively as follows: - - * A commit with no parents (a root commit) has generation number one. - - * A commit with at least one parent has generation number one more than - the largest generation number among its parents. - -Equivalently, the generation number of a commit A is one more than the -length of a longest path from A to a root commit. The recursive definition -is easier to use for computation and observing the following property: - - If A and B are commits with generation numbers N and M, respectively, - and N <= M, then A cannot reach B. That is, we know without searching - that B is not an ancestor of A because it is further from a root commit - than A. - - Conversely, when checking if A is an ancestor of B, then we only need - to walk commits until all commits on the walk boundary have generation - number at most N. If we walk commits using a priority queue seeded by - generation numbers, then we always expand the boundary commit with highest - generation number and can easily detect the stopping condition. - -This property can be used to significantly reduce the time it takes to -walk commits and determine topological relationships. Without generation -numbers, the general heuristic is the following: - - If A and B are commits with commit time X and Y, respectively, and - X < Y, then A _probably_ cannot reach B. - -This heuristic is currently used whenever the computation is allowed to -violate topological relationships due to clock skew (such as "git log" -with default order), but is not used when the topological order is -required (such as merge base calculations, "git log --graph"). - -In practice, we expect some commits to be created recently and not stored -in the commit graph. We can treat these commits as having "infinite" -generation number and walk until reaching commits with known generation -number. - -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not -in the commit-graph file. If a commit-graph file was written by a version -of Git that did not compute generation numbers, then those commits will -have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. - -Since the commit-graph file is closed under reachability, we can guarantee -the following weaker condition on all commits: - - If A and B are commits with generation numbers N and M, respectively, - and N < M, then A cannot reach B. - -Note how the strict inequality differs from the inequality when we have -fully-computed generation numbers. Using strict inequality may result in -walking a few extra commits, but the simplicity in dealing with commits -with generation number *_INFINITY or *_ZERO is valuable. - -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose -generation numbers are computed to be at least this value. We limit at -this value since it is the largest value that can be stored in the -commit-graph file using the 30 bits available to generation numbers. This -presents another case where a commit can have generation number equal to -that of a parent. - -Design Details --------------- - -- The commit-graph file is stored in a file named 'commit-graph' in the - .git/objects/info directory. This could be stored in the info directory - of an alternate. - -- The core.commitGraph config setting must be on to consume graph files. - -- The file format includes parameters for the object ID hash function, - so a future change of hash algorithm does not require a change in format. - -- Commit grafts and replace objects can change the shape of the commit - history. The latter can also be enabled/disabled on the fly using - `--no-replace-objects`. This leads to difficultly storing both possible - interpretations of a commit id, especially when computing generation - numbers. The commit-graph will not be read or written when - replace-objects or grafts are present. - -- Shallow clones create grafts of commits by dropping their parents. This - leads the commit-graph to think those commits have generation number 1. - If and when those commits are made unshallow, those generation numbers - become invalid. Since shallow clones are intended to restrict the commit - history to a very small set of commits, the commit-graph feature is less - helpful for these clones, anyway. The commit-graph will not be read or - written when shallow commits are present. - -Commit Graphs Chains --------------------- - -Typically, repos grow with near-constant velocity (commits per day). Over time, -the number of commits added by a fetch operation is much smaller than the -number of commits in the full history. By creating a "chain" of commit-graphs, -we enable fast writes of new commit data without rewriting the entire commit -history -- at least, most of the time. - -## File Layout - -A commit-graph chain uses multiple files, and we use a fixed naming convention -to organize these files. Each commit-graph file has a name -`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex- -valued hash stored in the footer of that file (which is a hash of the file's -contents before that hash). For a chain of commit-graph files, a plain-text -file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the -hashes for the files in order from "lowest" to "highest". - -For example, if the `commit-graph-chain` file contains the lines - -``` - {hash0} - {hash1} - {hash2} -``` - -then the commit-graph chain looks like the following diagram: - - +-----------------------+ - | graph-{hash2}.graph | - +-----------------------+ - | - +-----------------------+ - | | - | graph-{hash1}.graph | - | | - +-----------------------+ - | - +-----------------------+ - | | - | | - | | - | graph-{hash0}.graph | - | | - | | - | | - +-----------------------+ - -Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of -commits in `graph-{hash1}.graph`, and X2 be the number of commits in -`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`, -then we interpret this as being the commit in position (X0 + X1 + i), and that -will be used as its "graph position". The commits in `graph-{hash2}.graph` use these -positions to refer to their parents, which may be in `graph-{hash1}.graph` or -`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking -its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + -X2). - -Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data -specifying the hashes of all files in the lower layers. In the above example, -`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains -`{hash0}` and `{hash1}`. - -## Merging commit-graph files - -If we only added a new commit-graph file on every write, we would run into a -linear search problem through many commit-graph files. Instead, we use a merge -strategy to decide when the stack should collapse some number of levels. - -The diagram below shows such a collapse. As a set of new commits are added, it -is determined by the merge strategy that the files should collapse to -`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and -the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` -file. - - +---------------------+ - | | - | (new commits) | - | | - +---------------------+ - | | - +-----------------------+ +---------------------+ - | graph-{hash2} |->| | - +-----------------------+ +---------------------+ - | | | - +-----------------------+ +---------------------+ - | | | | - | graph-{hash1} |->| | - | | | | - +-----------------------+ +---------------------+ - | tmp_graphXXX - +-----------------------+ - | | - | | - | | - | graph-{hash0} | - | | - | | - | | - +-----------------------+ - -During this process, the commits to write are combined, sorted and we write the -contents to a temporary file, all while holding a `commit-graph-chain.lock` -lock-file. When the file is flushed, we rename it to `graph-{hash3}` -according to the computed `{hash3}`. Finally, we write the new chain data to -`commit-graph-chain.lock`: - -``` - {hash3} - {hash0} -``` - -We then close the lock-file. - -## Merge Strategy - -When writing a set of commits that do not exist in the commit-graph stack of -height N, we default to creating a new file at level N + 1. We then decide to -merge with the Nth level if one of two conditions hold: - - 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in - level N is less than X times the number of commits in level N + 1. - - 2. `--max-commits=<C>` is specified with non-zero C and the number of commits - in level N + 1 is more than C commits. - -This decision cascades down the levels: when we merge a level we create a new -set of commits that then compares to the next level. - -The first condition bounds the number of levels to be logarithmic in the total -number of commits. The second condition bounds the total number of commits in -a `graph-{hashN}` file and not in the `commit-graph` file, preventing -significant performance issues when the stack merges and another process only -partially reads the previous stack. - -The merge strategy values (2 for the size multiple, 64,000 for the maximum -number of commits) could be extracted into config settings for full -flexibility. - -## Deleting graph-{hash} files - -After a new tip file is written, some `graph-{hash}` files may no longer -be part of a chain. It is important to remove these files from disk, eventually. -The main reason to delay removal is that another process could read the -`commit-graph-chain` file before it is rewritten, but then look for the -`graph-{hash}` files after they are deleted. - -To allow holding old split commit-graphs for a while after they are unreferenced, -we update the modified times of the files when they become unreferenced. Then, -we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}` -files whose modified times are older than a given expiry window. This window -defaults to zero, but can be changed using command-line arguments or a config -setting. - -## Chains across multiple object directories - -In a repo with alternates, we look for the `commit-graph-chain` file starting -in the local object directory and then in each alternate. The first file that -exists defines our chain. As we look for the `graph-{hash}` files for -each `{hash}` in the chain file, we follow the same pattern for the host -directories. - -This allows commit-graphs to be split across multiple forks in a fork network. -The typical case is a large "base" repo with many smaller forks. - -As the base repo advances, it will likely update and merge its commit-graph -chain more frequently than the forks. If a fork updates their commit-graph after -the base repo, then it should "reparent" the commit-graph chain onto the new -chain in the base repo. When reading each `graph-{hash}` file, we track -the object directory containing it. During a write of a new commit-graph file, -we check for any changes in the source object directory and read the -`commit-graph-chain` file for that source and create a new file based on those -files. During this "reparent" operation, we necessarily need to collapse all -levels in the fork, as all of the files are invalid against the new base file. - -It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph` -files in this scenario. It falls to the user to define the proper settings for -their custom environment: - - 1. When merging levels in the base repo, the unreferenced files may still be - referenced by chains from fork repos. - - 2. The expiry time should be set to a length of time such that every fork has - time to recompute their commit-graph chain to "reparent" onto the new base - file(s). - - 3. If the commit-graph chain is updated in the base, the fork will not have - access to the new chain until its chain is updated to reference those files. - (This may change in the future [5].) - -Related Links -------------- -[0] https://bugs.chromium.org/p/git/issues/detail?id=8 - Chromium work item for: Serialized Commit Graph - -[1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/ - An abandoned patch that introduced generation numbers. - -[2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ - Discussion about generation numbers on commits and how they interact - with fsck. - -[3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ - More discussion about generation numbers and not storing them inside - commit objects. A valuable quote: - - "I think we should be moving more in the direction of keeping - repo-local caches for optimizations. Reachability bitmaps have been - a big performance win. I think we should be doing the same with our - properties of commits. Not just generation numbers, but making it - cheap to access the graph structure without zlib-inflating whole - commit objects (i.e., packv4 or something like the "metapacks" I - proposed a few years ago)." - -[4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u - A patch to remove the ahead-behind calculation from 'status'. - -[5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/ - A discussion of a "two-dimensional graph position" that can allow reading - multiple commit-graph chains at the same time. |