diff options
Diffstat (limited to 'third_party/git/Documentation/technical')
33 files changed, 0 insertions, 9265 deletions
diff --git a/third_party/git/Documentation/technical/.gitignore b/third_party/git/Documentation/technical/.gitignore deleted file mode 100644 index 8aa891daee05..000000000000 --- a/third_party/git/Documentation/technical/.gitignore +++ /dev/null @@ -1 +0,0 @@ -api-index.txt diff --git a/third_party/git/Documentation/technical/api-error-handling.txt b/third_party/git/Documentation/technical/api-error-handling.txt deleted file mode 100644 index ceeedd485c96..000000000000 --- a/third_party/git/Documentation/technical/api-error-handling.txt +++ /dev/null @@ -1,75 +0,0 @@ -Error reporting in git -====================== - -`die`, `usage`, `error`, and `warning` report errors of various -kinds. - -- `die` is for fatal application errors. It prints a message to - the user and exits with status 128. - -- `usage` is for errors in command line usage. After printing its - message, it exits with status 129. (See also `usage_with_options` - in the link:api-parse-options.html[parse-options API].) - -- `error` is for non-fatal library errors. It prints a message - to the user and returns -1 for convenience in signaling the error - to the caller. - -- `warning` is for reporting situations that probably should not - occur but which the user (and Git) can continue to work around - without running into too many problems. Like `error`, it - returns -1 after reporting the situation to the caller. - -Customizable error handlers ---------------------------- - -The default behavior of `die` and `error` is to write a message to -stderr and then exit or return as appropriate. This behavior can be -overridden using `set_die_routine` and `set_error_routine`. For -example, "git daemon" uses set_die_routine to write the reason `die` -was called to syslog before exiting. - -Library errors --------------- - -Functions return a negative integer on error. Details beyond that -vary from function to function: - -- Some functions return -1 for all errors. Others return a more - specific value depending on how the caller might want to react - to the error. - -- Some functions report the error to stderr with `error`, - while others leave that for the caller to do. - -- errno is not meaningful on return from most functions (except - for thin wrappers for system calls). - -Check the function's API documentation to be sure. - -Caller-handled errors ---------------------- - -An increasing number of functions take a parameter 'struct strbuf *err'. -On error, such functions append a message about what went wrong to the -'err' strbuf. The message is meant to be complete enough to be passed -to `die` or `error` as-is. For example: - - if (ref_transaction_commit(transaction, &err)) - die("%s", err.buf); - -The 'err' parameter will be untouched if no error occurred, so multiple -function calls can be chained: - - t = ref_transaction_begin(&err); - if (!t || - ref_transaction_update(t, "HEAD", ..., &err) || - ret_transaction_commit(t, &err)) - die("%s", err.buf); - -The 'err' parameter must be a pointer to a valid strbuf. To silence -a message, pass a strbuf that is explicitly ignored: - - if (thing_that_can_fail_in_an_ignorable_way(..., &err)) - /* This failure is okay. */ - strbuf_reset(&err); diff --git a/third_party/git/Documentation/technical/api-index-skel.txt b/third_party/git/Documentation/technical/api-index-skel.txt deleted file mode 100644 index eda8c195c196..000000000000 --- a/third_party/git/Documentation/technical/api-index-skel.txt +++ /dev/null @@ -1,13 +0,0 @@ -Git API Documents -================= - -Git has grown a set of internal API over time. This collection -documents them. - -//////////////////////////////////////////////////////////////// -// table of contents begin -//////////////////////////////////////////////////////////////// - -//////////////////////////////////////////////////////////////// -// table of contents end -//////////////////////////////////////////////////////////////// diff --git a/third_party/git/Documentation/technical/api-index.sh b/third_party/git/Documentation/technical/api-index.sh deleted file mode 100755 index 9c3f4131b858..000000000000 --- a/third_party/git/Documentation/technical/api-index.sh +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/sh - -( - c=//////////////////////////////////////////////////////////////// - skel=api-index-skel.txt - sed -e '/^\/\/ table of contents begin/q' "$skel" - echo "$c" - - ls api-*.txt | - while read filename - do - case "$filename" in - api-index-skel.txt | api-index.txt) continue ;; - esac - title=$(sed -e 1q "$filename") - html=${filename%.txt}.html - echo "* link:$html[$title]" - done - echo "$c" - sed -n -e '/^\/\/ table of contents end/,$p' "$skel" -) >api-index.txt+ - -if test -f api-index.txt && cmp api-index.txt api-index.txt+ >/dev/null -then - rm -f api-index.txt+ -else - mv api-index.txt+ api-index.txt -fi diff --git a/third_party/git/Documentation/technical/api-merge.txt b/third_party/git/Documentation/technical/api-merge.txt deleted file mode 100644 index 487d4d83fff2..000000000000 --- a/third_party/git/Documentation/technical/api-merge.txt +++ /dev/null @@ -1,36 +0,0 @@ -merge API -========= - -The merge API helps a program to reconcile two competing sets of -improvements to some files (e.g., unregistered changes from the work -tree versus changes involved in switching to a new branch), reporting -conflicts if found. The library called through this API is -responsible for a few things. - - * determining which trees to merge (recursive ancestor consolidation); - - * lining up corresponding files in the trees to be merged (rename - detection, subtree shifting), reporting edge cases like add/add - and rename/rename conflicts to the user; - - * performing a three-way merge of corresponding files, taking - path-specific merge drivers (specified in `.gitattributes`) - into account. - -Data structures ---------------- - -* `mmbuffer_t`, `mmfile_t` - -These store data usable for use by the xdiff backend, for writing and -for reading, respectively. See `xdiff/xdiff.h` for the definitions -and `diff.c` for examples. - -* `struct ll_merge_options` - -Check ll-merge.h for details. - -Low-level (single file) merge ------------------------------ - -Check ll-merge.h for details. diff --git a/third_party/git/Documentation/technical/api-parse-options.txt b/third_party/git/Documentation/technical/api-parse-options.txt deleted file mode 100644 index 5a60bbfa7f41..000000000000 --- a/third_party/git/Documentation/technical/api-parse-options.txt +++ /dev/null @@ -1,313 +0,0 @@ -parse-options API -================= - -The parse-options API is used to parse and massage options in Git -and to provide a usage help with consistent look. - -Basics ------- - -The argument vector `argv[]` may usually contain mandatory or optional -'non-option arguments', e.g. a filename or a branch, and 'options'. -Options are optional arguments that start with a dash and -that allow to change the behavior of a command. - -* There are basically three types of options: - 'boolean' options, - options with (mandatory) 'arguments' and - options with 'optional arguments' - (i.e. a boolean option that can be adjusted). - -* There are basically two forms of options: - 'Short options' consist of one dash (`-`) and one alphanumeric - character. - 'Long options' begin with two dashes (`--`) and some - alphanumeric characters. - -* Options are case-sensitive. - Please define 'lower-case long options' only. - -The parse-options API allows: - -* 'stuck' and 'separate form' of options with arguments. - `-oArg` is stuck, `-o Arg` is separate form. - `--option=Arg` is stuck, `--option Arg` is separate form. - -* Long options may be 'abbreviated', as long as the abbreviation - is unambiguous. - -* Short options may be bundled, e.g. `-a -b` can be specified as `-ab`. - -* Boolean long options can be 'negated' (or 'unset') by prepending - `no-`, e.g. `--no-abbrev` instead of `--abbrev`. Conversely, - options that begin with `no-` can be 'negated' by removing it. - Other long options can be unset (e.g., set string to NULL, set - integer to 0) by prepending `no-`. - -* Options and non-option arguments can clearly be separated using the `--` - option, e.g. `-a -b --option -- --this-is-a-file` indicates that - `--this-is-a-file` must not be processed as an option. - -Steps to parse options ----------------------- - -. `#include "parse-options.h"` - -. define a NULL-terminated - `static const char * const builtin_foo_usage[]` array - containing alternative usage strings - -. define `builtin_foo_options` array as described below - in section 'Data Structure'. - -. in `cmd_foo(int argc, const char **argv, const char *prefix)` - call - - argc = parse_options(argc, argv, prefix, builtin_foo_options, builtin_foo_usage, flags); -+ -`parse_options()` will filter out the processed options of `argv[]` and leave the -non-option arguments in `argv[]`. -`argc` is updated appropriately because of the assignment. -+ -You can also pass NULL instead of a usage array as the fifth parameter of -parse_options(), to avoid displaying a help screen with usage info and -option list. This should only be done if necessary, e.g. to implement -a limited parser for only a subset of the options that needs to be run -before the full parser, which in turn shows the full help message. -+ -Flags are the bitwise-or of: - -`PARSE_OPT_KEEP_DASHDASH`:: - Keep the `--` that usually separates options from - non-option arguments. - -`PARSE_OPT_STOP_AT_NON_OPTION`:: - Usually the whole argument vector is massaged and reordered. - Using this flag, processing is stopped at the first non-option - argument. - -`PARSE_OPT_KEEP_ARGV0`:: - Keep the first argument, which contains the program name. It's - removed from argv[] by default. - -`PARSE_OPT_KEEP_UNKNOWN`:: - Keep unknown arguments instead of erroring out. This doesn't - work for all combinations of arguments as users might expect - it to do. E.g. if the first argument in `--unknown --known` - takes a value (which we can't know), the second one is - mistakenly interpreted as a known option. Similarly, if - `PARSE_OPT_STOP_AT_NON_OPTION` is set, the second argument in - `--unknown value` will be mistakenly interpreted as a - non-option, not as a value belonging to the unknown option, - the parser early. That's why parse_options() errors out if - both options are set. - -`PARSE_OPT_NO_INTERNAL_HELP`:: - By default, parse_options() handles `-h`, `--help` and - `--help-all` internally, by showing a help screen. This option - turns it off and allows one to add custom handlers for these - options, or to just leave them unknown. - -Data Structure --------------- - -The main data structure is an array of the `option` struct, -say `static struct option builtin_add_options[]`. -There are some macros to easily define options: - -`OPT__ABBREV(&int_var)`:: - Add `--abbrev[=<n>]`. - -`OPT__COLOR(&int_var, description)`:: - Add `--color[=<when>]` and `--no-color`. - -`OPT__DRY_RUN(&int_var, description)`:: - Add `-n, --dry-run`. - -`OPT__FORCE(&int_var, description)`:: - Add `-f, --force`. - -`OPT__QUIET(&int_var, description)`:: - Add `-q, --quiet`. - -`OPT__VERBOSE(&int_var, description)`:: - Add `-v, --verbose`. - -`OPT_GROUP(description)`:: - Start an option group. `description` is a short string that - describes the group or an empty string. - Start the description with an upper-case letter. - -`OPT_BOOL(short, long, &int_var, description)`:: - Introduce a boolean option. `int_var` is set to one with - `--option` and set to zero with `--no-option`. - -`OPT_COUNTUP(short, long, &int_var, description)`:: - Introduce a count-up option. - Each use of `--option` increments `int_var`, starting from zero - (even if initially negative), and `--no-option` resets it to - zero. To determine if `--option` or `--no-option` was encountered at - all, initialize `int_var` to a negative value, and if it is still - negative after parse_options(), then neither `--option` nor - `--no-option` was seen. - -`OPT_BIT(short, long, &int_var, description, mask)`:: - Introduce a boolean option. - If used, `int_var` is bitwise-ored with `mask`. - -`OPT_NEGBIT(short, long, &int_var, description, mask)`:: - Introduce a boolean option. - If used, `int_var` is bitwise-anded with the inverted `mask`. - -`OPT_SET_INT(short, long, &int_var, description, integer)`:: - Introduce an integer option. - `int_var` is set to `integer` with `--option`, and - reset to zero with `--no-option`. - -`OPT_STRING(short, long, &str_var, arg_str, description)`:: - Introduce an option with string argument. - The string argument is put into `str_var`. - -`OPT_STRING_LIST(short, long, &struct string_list, arg_str, description)`:: - Introduce an option with string argument. - The string argument is stored as an element in `string_list`. - Use of `--no-option` will clear the list of preceding values. - -`OPT_INTEGER(short, long, &int_var, description)`:: - Introduce an option with integer argument. - The integer is put into `int_var`. - -`OPT_MAGNITUDE(short, long, &unsigned_long_var, description)`:: - Introduce an option with a size argument. The argument must be a - non-negative integer and may include a suffix of 'k', 'm' or 'g' to - scale the provided value by 1024, 1024^2 or 1024^3 respectively. - The scaled value is put into `unsigned_long_var`. - -`OPT_EXPIRY_DATE(short, long, ×tamp_t_var, description)`:: - Introduce an option with expiry date argument, see `parse_expiry_date()`. - The timestamp is put into `timestamp_t_var`. - -`OPT_CALLBACK(short, long, &var, arg_str, description, func_ptr)`:: - Introduce an option with argument. - The argument will be fed into the function given by `func_ptr` - and the result will be put into `var`. - See 'Option Callbacks' below for a more elaborate description. - -`OPT_FILENAME(short, long, &var, description)`:: - Introduce an option with a filename argument. - The filename will be prefixed by passing the filename along with - the prefix argument of `parse_options()` to `prefix_filename()`. - -`OPT_ARGUMENT(long, &int_var, description)`:: - Introduce a long-option argument that will be kept in `argv[]`. - If this option was seen, `int_var` will be set to one (except - if a `NULL` pointer was passed). - -`OPT_NUMBER_CALLBACK(&var, description, func_ptr)`:: - Recognize numerical options like -123 and feed the integer as - if it was an argument to the function given by `func_ptr`. - The result will be put into `var`. There can be only one such - option definition. It cannot be negated and it takes no - arguments. Short options that happen to be digits take - precedence over it. - -`OPT_COLOR_FLAG(short, long, &int_var, description)`:: - Introduce an option that takes an optional argument that can - have one of three values: "always", "never", or "auto". If the - argument is not given, it defaults to "always". The `--no-` form - works like `--long=never`; it cannot take an argument. If - "always", set `int_var` to 1; if "never", set `int_var` to 0; if - "auto", set `int_var` to 1 if stdout is a tty or a pager, - 0 otherwise. - -`OPT_NOOP_NOARG(short, long)`:: - Introduce an option that has no effect and takes no arguments. - Use it to hide deprecated options that are still to be recognized - and ignored silently. - -`OPT_PASSTHRU(short, long, &char_var, arg_str, description, flags)`:: - Introduce an option that will be reconstructed into a char* string, - which must be initialized to NULL. This is useful when you need to - pass the command-line option to another command. Any previous value - will be overwritten, so this should only be used for options where - the last one specified on the command line wins. - -`OPT_PASSTHRU_ARGV(short, long, &strvec_var, arg_str, description, flags)`:: - Introduce an option where all instances of it on the command-line will - be reconstructed into a strvec. This is useful when you need to - pass the command-line option, which can be specified multiple times, - to another command. - -`OPT_CMDMODE(short, long, &int_var, description, enum_val)`:: - Define an "operation mode" option, only one of which in the same - group of "operating mode" options that share the same `int_var` - can be given by the user. `enum_val` is set to `int_var` when the - option is used, but an error is reported if other "operating mode" - option has already set its value to the same `int_var`. - - -The last element of the array must be `OPT_END()`. - -If not stated otherwise, interpret the arguments as follows: - -* `short` is a character for the short option - (e.g. `'e'` for `-e`, use `0` to omit), - -* `long` is a string for the long option - (e.g. `"example"` for `--example`, use `NULL` to omit), - -* `int_var` is an integer variable, - -* `str_var` is a string variable (`char *`), - -* `arg_str` is the string that is shown as argument - (e.g. `"branch"` will result in `<branch>`). - If set to `NULL`, three dots (`...`) will be displayed. - -* `description` is a short string to describe the effect of the option. - It shall begin with a lower-case letter and a full stop (`.`) shall be - omitted at the end. - -Option Callbacks ----------------- - -The function must be defined in this form: - - int func(const struct option *opt, const char *arg, int unset) - -The callback mechanism is as follows: - -* Inside `func`, the only interesting member of the structure - given by `opt` is the void pointer `opt->value`. - `*opt->value` will be the value that is saved into `var`, if you - use `OPT_CALLBACK()`. - For example, do `*(unsigned long *)opt->value = 42;` to get 42 - into an `unsigned long` variable. - -* Return value `0` indicates success and non-zero return - value will invoke `usage_with_options()` and, thus, die. - -* If the user negates the option, `arg` is `NULL` and `unset` is 1. - -Sophisticated option parsing ----------------------------- - -If you need, for example, option callbacks with optional arguments -or without arguments at all, or if you need other special cases, -that are not handled by the macros above, you need to specify the -members of the `option` structure manually. - -This is not covered in this document, but well documented -in `parse-options.h` itself. - -Examples --------- - -See `test-parse-options.c` and -`builtin/add.c`, -`builtin/clone.c`, -`builtin/commit.c`, -`builtin/fetch.c`, -`builtin/fsck.c`, -`builtin/rm.c` -for real-world examples. diff --git a/third_party/git/Documentation/technical/api-trace2.txt b/third_party/git/Documentation/technical/api-trace2.txt deleted file mode 100644 index 6b6085585d56..000000000000 --- a/third_party/git/Documentation/technical/api-trace2.txt +++ /dev/null @@ -1,1171 +0,0 @@ -= Trace2 API - -The Trace2 API can be used to print debug, performance, and telemetry -information to stderr or a file. The Trace2 feature is inactive unless -explicitly enabled by enabling one or more Trace2 Targets. - -The Trace2 API is intended to replace the existing (Trace1) -printf-style tracing provided by the existing `GIT_TRACE` and -`GIT_TRACE_PERFORMANCE` facilities. During initial implementation, -Trace2 and Trace1 may operate in parallel. - -The Trace2 API defines a set of high-level messages with known fields, -such as (`start`: `argv`) and (`exit`: {`exit-code`, `elapsed-time`}). - -Trace2 instrumentation throughout the Git code base sends Trace2 -messages to the enabled Trace2 Targets. Targets transform these -messages content into purpose-specific formats and write events to -their data streams. In this manner, the Trace2 API can drive -many different types of analysis. - -Targets are defined using a VTable allowing easy extension to other -formats in the future. This might be used to define a binary format, -for example. - -Trace2 is controlled using `trace2.*` config values in the system and -global config files and `GIT_TRACE2*` environment variables. Trace2 does -not read from repo local or worktree config files or respect `-c` -command line config settings. - -== Trace2 Targets - -Trace2 defines the following set of Trace2 Targets. -Format details are given in a later section. - -=== The Normal Format Target - -The normal format target is a tradition printf format and similar -to GIT_TRACE format. This format is enabled with the `GIT_TRACE2` -environment variable or the `trace2.normalTarget` system or global -config setting. - -For example - ------------- -$ export GIT_TRACE2=~/log.normal -$ git version -git version 2.20.1.155.g426c96fcdb ------------- - -or - ------------- -$ git config --global trace2.normalTarget ~/log.normal -$ git version -git version 2.20.1.155.g426c96fcdb ------------- - -yields - ------------- -$ cat ~/log.normal -12:28:42.620009 common-main.c:38 version 2.20.1.155.g426c96fcdb -12:28:42.620989 common-main.c:39 start git version -12:28:42.621101 git.c:432 cmd_name version (version) -12:28:42.621215 git.c:662 exit elapsed:0.001227 code:0 -12:28:42.621250 trace2/tr2_tgt_normal.c:124 atexit elapsed:0.001265 code:0 ------------- - -=== The Performance Format Target - -The performance format target (PERF) is a column-based format to -replace GIT_TRACE_PERFORMANCE and is suitable for development and -testing, possibly to complement tools like gprof. This format is -enabled with the `GIT_TRACE2_PERF` environment variable or the -`trace2.perfTarget` system or global config setting. - -For example - ------------- -$ export GIT_TRACE2_PERF=~/log.perf -$ git version -git version 2.20.1.155.g426c96fcdb ------------- - -or - ------------- -$ git config --global trace2.perfTarget ~/log.perf -$ git version -git version 2.20.1.155.g426c96fcdb ------------- - -yields - ------------- -$ cat ~/log.perf -12:28:42.620675 common-main.c:38 | d0 | main | version | | | | | 2.20.1.155.g426c96fcdb -12:28:42.621001 common-main.c:39 | d0 | main | start | | 0.001173 | | | git version -12:28:42.621111 git.c:432 | d0 | main | cmd_name | | | | | version (version) -12:28:42.621225 git.c:662 | d0 | main | exit | | 0.001227 | | | code:0 -12:28:42.621259 trace2/tr2_tgt_perf.c:211 | d0 | main | atexit | | 0.001265 | | | code:0 ------------- - -=== The Event Format Target - -The event format target is a JSON-based format of event data suitable -for telemetry analysis. This format is enabled with the `GIT_TRACE2_EVENT` -environment variable or the `trace2.eventTarget` system or global config -setting. - -For example - ------------- -$ export GIT_TRACE2_EVENT=~/log.event -$ git version -git version 2.20.1.155.g426c96fcdb ------------- - -or - ------------- -$ git config --global trace2.eventTarget ~/log.event -$ git version -git version 2.20.1.155.g426c96fcdb ------------- - -yields - ------------- -$ cat ~/log.event -{"event":"version","sid":"sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.620713Z","file":"common-main.c","line":38,"evt":"2","exe":"2.20.1.155.g426c96fcdb"} -{"event":"start","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621027Z","file":"common-main.c","line":39,"t_abs":0.001173,"argv":["git","version"]} -{"event":"cmd_name","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621122Z","file":"git.c","line":432,"name":"version","hierarchy":"version"} -{"event":"exit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621236Z","file":"git.c","line":662,"t_abs":0.001227,"code":0} -{"event":"atexit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621268Z","file":"trace2/tr2_tgt_event.c","line":163,"t_abs":0.001265,"code":0} ------------- - -=== Enabling a Target - -To enable a target, set the corresponding environment variable or -system or global config value to one of the following: - -include::../trace2-target-values.txt[] - -When trace files are written to a target directory, they will be named according -to the last component of the SID (optionally followed by a counter to avoid -filename collisions). - -== Trace2 API - -All public Trace2 functions and macros are defined in `trace2.h` and -`trace2.c`. All public symbols are prefixed with `trace2_`. - -There are no public Trace2 data structures. - -The Trace2 code also defines a set of private functions and data types -in the `trace2/` directory. These symbols are prefixed with `tr2_` -and should only be used by functions in `trace2.c`. - -== Conventions for Public Functions and Macros - -The functions defined by the Trace2 API are declared and documented -in `trace2.h`. It defines the API functions and wrapper macros for -Trace2. - -Some functions have a `_fl()` suffix to indicate that they take `file` -and `line-number` arguments. - -Some functions have a `_va_fl()` suffix to indicate that they also -take a `va_list` argument. - -Some functions have a `_printf_fl()` suffix to indicate that they also -take a varargs argument. - -There are CPP wrapper macros and ifdefs to hide most of these details. -See `trace2.h` for more details. The following discussion will only -describe the simplified forms. - -== Public API - -All Trace2 API functions send a message to all of the active -Trace2 Targets. This section describes the set of available -messages. - -It helps to divide these functions into groups for discussion -purposes. - -=== Basic Command Messages - -These are concerned with the lifetime of the overall git process. -e.g: `void trace2_initialize_clock()`, `void trace2_initialize()`, -`int trace2_is_enabled()`, `void trace2_cmd_start(int argc, const char **argv)`. - -=== Command Detail Messages - -These are concerned with describing the specific Git command -after the command line, config, and environment are inspected. -e.g: `void trace2_cmd_name(const char *name)`, -`void trace2_cmd_mode(const char *mode)`. - -=== Child Process Messages - -These are concerned with the various spawned child processes, -including shell scripts, git commands, editors, pagers, and hooks. - -e.g: `void trace2_child_start(struct child_process *cmd)`. - -=== Git Thread Messages - -These messages are concerned with Git thread usage. - -e.g: `void trace2_thread_start(const char *thread_name)`. - -=== Region and Data Messages - -These are concerned with recording performance data -over regions or spans of code. e.g: -`void trace2_region_enter(const char *category, const char *label, const struct repository *repo)`. - -Refer to trace2.h for details about all trace2 functions. - -== Trace2 Target Formats - -=== NORMAL Format - -Events are written as lines of the form: - ------------- -[<time> SP <filename>:<line> SP+] <event-name> [[SP] <event-message>] LF ------------- - -`<event-name>`:: - - is the event name. - -`<event-message>`:: - is a free-form printf message intended for human consumption. -+ -Note that this may contain embedded LF or CRLF characters that are -not escaped, so the event may spill across multiple lines. - -If `GIT_TRACE2_BRIEF` or `trace2.normalBrief` is true, the `time`, `filename`, -and `line` fields are omitted. - -This target is intended to be more of a summary (like GIT_TRACE) and -less detailed than the other targets. It ignores thread, region, and -data messages, for example. - -=== PERF Format - -Events are written as lines of the form: - ------------- -[<time> SP <filename>:<line> SP+ - BAR SP] d<depth> SP - BAR SP <thread-name> SP+ - BAR SP <event-name> SP+ - BAR SP [r<repo-id>] SP+ - BAR SP [<t_abs>] SP+ - BAR SP [<t_rel>] SP+ - BAR SP [<category>] SP+ - BAR SP DOTS* <perf-event-message> - LF ------------- - -`<depth>`:: - is the git process depth. This is the number of parent - git processes. A top-level git command has depth value "d0". - A child of it has depth value "d1". A second level child - has depth value "d2" and so on. - -`<thread-name>`:: - is a unique name for the thread. The primary thread - is called "main". Other thread names are of the form "th%d:%s" - and include a unique number and the name of the thread-proc. - -`<event-name>`:: - is the event name. - -`<repo-id>`:: - when present, is a number indicating the repository - in use. A `def_repo` event is emitted when a repository is - opened. This defines the repo-id and associated worktree. - Subsequent repo-specific events will reference this repo-id. -+ -Currently, this is always "r1" for the main repository. -This field is in anticipation of in-proc submodules in the future. - -`<t_abs>`:: - when present, is the absolute time in seconds since the - program started. - -`<t_rel>`:: - when present, is time in seconds relative to the start of - the current region. For a thread-exit event, it is the elapsed - time of the thread. - -`<category>`:: - is present on region and data events and is used to - indicate a broad category, such as "index" or "status". - -`<perf-event-message>`:: - is a free-form printf message intended for human consumption. - ------------- -15:33:33.532712 wt-status.c:2310 | d0 | main | region_enter | r1 | 0.126064 | | status | label:print -15:33:33.532712 wt-status.c:2331 | d0 | main | region_leave | r1 | 0.127568 | 0.001504 | status | label:print ------------- - -If `GIT_TRACE2_PERF_BRIEF` or `trace2.perfBrief` is true, the `time`, `file`, -and `line` fields are omitted. - ------------- -d0 | main | region_leave | r1 | 0.011717 | 0.009122 | index | label:preload ------------- - -The PERF target is intended for interactive performance analysis -during development and is quite noisy. - -=== EVENT Format - -Each event is a JSON-object containing multiple key/value pairs -written as a single line and followed by a LF. - ------------- -'{' <key> ':' <value> [',' <key> ':' <value>]* '}' LF ------------- - -Some key/value pairs are common to all events and some are -event-specific. - -==== Common Key/Value Pairs - -The following key/value pairs are common to all events: - ------------- -{ - "event":"version", - "sid":"20190408T191827.272759Z-H9b68c35f-P00003510", - "thread":"main", - "time":"2019-04-08T19:18:27.282761Z", - "file":"common-main.c", - "line":42, - ... -} ------------- - -`"event":<event>`:: - is the event name. - -`"sid":<sid>`:: - is the session-id. This is a unique string to identify the - process instance to allow all events emitted by a process to - be identified. A session-id is used instead of a PID because - PIDs are recycled by the OS. For child git processes, the - session-id is prepended with the session-id of the parent git - process to allow parent-child relationships to be identified - during post-processing. - -`"thread":<thread>`:: - is the thread name. - -`"time":<time>`:: - is the UTC time of the event. - -`"file":<filename>`:: - is source file generating the event. - -`"line":<line-number>`:: - is the integer source line number generating the event. - -`"repo":<repo-id>`:: - when present, is the integer repo-id as described previously. - -If `GIT_TRACE2_EVENT_BRIEF` or `trace2.eventBrief` is true, the `file` -and `line` fields are omitted from all events and the `time` field is -only present on the "start" and "atexit" events. - -==== Event-Specific Key/Value Pairs - -`"version"`:: - This event gives the version of the executable and the EVENT format. It - should always be the first event in a trace session. The EVENT format - version will be incremented if new event types are added, if existing - fields are removed, or if there are significant changes in - interpretation of existing events or fields. Smaller changes, such as - adding a new field to an existing event, will not require an increment - to the EVENT format version. -+ ------------- -{ - "event":"version", - ... - "evt":"2", # EVENT format version - "exe":"2.20.1.155.g426c96fcdb" # git version -} ------------- - -`"discard"`:: - This event is written to the git-trace2-discard sentinel file if there - are too many files in the target trace directory (see the - trace2.maxFiles config option). -+ ------------- -{ - "event":"discard", - ... -} ------------- - -`"start"`:: - This event contains the complete argv received by main(). -+ ------------- -{ - "event":"start", - ... - "t_abs":0.001227, # elapsed time in seconds - "argv":["git","version"] -} ------------- - -`"exit"`:: - This event is emitted when git calls `exit()`. -+ ------------- -{ - "event":"exit", - ... - "t_abs":0.001227, # elapsed time in seconds - "code":0 # exit code -} ------------- - -`"atexit"`:: - This event is emitted by the Trace2 `atexit` routine during - final shutdown. It should be the last event emitted by the - process. -+ -(The elapsed time reported here is greater than the time reported in -the "exit" event because it runs after all other atexit tasks have -completed.) -+ ------------- -{ - "event":"atexit", - ... - "t_abs":0.001227, # elapsed time in seconds - "code":0 # exit code -} ------------- - -`"signal"`:: - This event is emitted when the program is terminated by a user - signal. Depending on the platform, the signal event may - prevent the "atexit" event from being generated. -+ ------------- -{ - "event":"signal", - ... - "t_abs":0.001227, # elapsed time in seconds - "signo":13 # SIGTERM, SIGINT, etc. -} ------------- - -`"error"`:: - This event is emitted when one of the `error()`, `die()`, - or `usage()` functions are called. -+ ------------- -{ - "event":"error", - ... - "msg":"invalid option: --cahced", # formatted error message - "fmt":"invalid option: %s" # error format string -} ------------- -+ -The error event may be emitted more than once. The format string -allows post-processors to group errors by type without worrying -about specific error arguments. - -`"cmd_path"`:: - This event contains the discovered full path of the git - executable (on platforms that are configured to resolve it). -+ ------------- -{ - "event":"cmd_path", - ... - "path":"C:/work/gfw/git.exe" -} ------------- - -`"cmd_name"`:: - This event contains the command name for this git process - and the hierarchy of commands from parent git processes. -+ ------------- -{ - "event":"cmd_name", - ... - "name":"pack-objects", - "hierarchy":"push/pack-objects" -} ------------- -+ -Normally, the "name" field contains the canonical name of the -command. When a canonical name is not available, one of -these special values are used: -+ ------------- -"_query_" # "git --html-path" -"_run_dashed_" # when "git foo" tries to run "git-foo" -"_run_shell_alias_" # alias expansion to a shell command -"_run_git_alias_" # alias expansion to a git command -"_usage_" # usage error ------------- - -`"cmd_mode"`:: - This event, when present, describes the command variant This - event may be emitted more than once. -+ ------------- -{ - "event":"cmd_mode", - ... - "name":"branch" -} ------------- -+ -The "name" field is an arbitrary string to describe the command mode. -For example, checkout can checkout a branch or an individual file. -And these variations typically have different performance -characteristics that are not comparable. - -`"alias"`:: - This event is present when an alias is expanded. -+ ------------- -{ - "event":"alias", - ... - "alias":"l", # registered alias - "argv":["log","--graph"] # alias expansion -} ------------- - -`"child_start"`:: - This event describes a child process that is about to be - spawned. -+ ------------- -{ - "event":"child_start", - ... - "child_id":2, - "child_class":"?", - "use_shell":false, - "argv":["git","rev-list","--objects","--stdin","--not","--all","--quiet"] - - "hook_name":"<hook_name>" # present when child_class is "hook" - "cd":"<path>" # present when cd is required -} ------------- -+ -The "child_id" field can be used to match this child_start with the -corresponding child_exit event. -+ -The "child_class" field is a rough classification, such as "editor", -"pager", "transport/*", and "hook". Unclassified children are classified -with "?". - -`"child_exit"`:: - This event is generated after the current process has returned - from the waitpid() and collected the exit information from the - child. -+ ------------- -{ - "event":"child_exit", - ... - "child_id":2, - "pid":14708, # child PID - "code":0, # child exit-code - "t_rel":0.110605 # observed run-time of child process -} ------------- -+ -Note that the session-id of the child process is not available to -the current/spawning process, so the child's PID is reported here as -a hint for post-processing. (But it is only a hint because the child -process may be a shell script which doesn't have a session-id.) -+ -Note that the `t_rel` field contains the observed run time in seconds -for the child process (starting before the fork/exec/spawn and -stopping after the waitpid() and includes OS process creation overhead). -So this time will be slightly larger than the atexit time reported by -the child process itself. - -`"exec"`:: - This event is generated before git attempts to `exec()` - another command rather than starting a child process. -+ ------------- -{ - "event":"exec", - ... - "exec_id":0, - "exe":"git", - "argv":["foo", "bar"] -} ------------- -+ -The "exec_id" field is a command-unique id and is only useful if the -`exec()` fails and a corresponding exec_result event is generated. - -`"exec_result"`:: - This event is generated if the `exec()` fails and control - returns to the current git command. -+ ------------- -{ - "event":"exec_result", - ... - "exec_id":0, - "code":1 # error code (errno) from exec() -} ------------- - -`"thread_start"`:: - This event is generated when a thread is started. It is - generated from *within* the new thread's thread-proc (for TLS - reasons). -+ ------------- -{ - "event":"thread_start", - ... - "thread":"th02:preload_thread" # thread name -} ------------- - -`"thread_exit"`:: - This event is generated when a thread exits. It is generated - from *within* the thread's thread-proc (for TLS reasons). -+ ------------- -{ - "event":"thread_exit", - ... - "thread":"th02:preload_thread", # thread name - "t_rel":0.007328 # thread elapsed time -} ------------- - -`"def_param"`:: - This event is generated to log a global parameter, such as a config - setting, command-line flag, or environment variable. -+ ------------- -{ - "event":"def_param", - ... - "param":"core.abbrev", - "value":"7" -} ------------- - -`"def_repo"`:: - This event defines a repo-id and associates it with the root - of the worktree. -+ ------------- -{ - "event":"def_repo", - ... - "repo":1, - "worktree":"/Users/jeffhost/work/gfw" -} ------------- -+ -As stated earlier, the repo-id is currently always 1, so there will -only be one def_repo event. Later, if in-proc submodules are -supported, a def_repo event should be emitted for each submodule -visited. - -`"region_enter"`:: - This event is generated when entering a region. -+ ------------- -{ - "event":"region_enter", - ... - "repo":1, # optional - "nesting":1, # current region stack depth - "category":"index", # optional - "label":"do_read_index", # optional - "msg":".git/index" # optional -} ------------- -+ -The `category` field may be used in a future enhancement to -do category-based filtering. -+ -`GIT_TRACE2_EVENT_NESTING` or `trace2.eventNesting` can be used to -filter deeply nested regions and data events. It defaults to "2". - -`"region_leave"`:: - This event is generated when leaving a region. -+ ------------- -{ - "event":"region_leave", - ... - "repo":1, # optional - "t_rel":0.002876, # time spent in region in seconds - "nesting":1, # region stack depth - "category":"index", # optional - "label":"do_read_index", # optional - "msg":".git/index" # optional -} ------------- - -`"data"`:: - This event is generated to log a thread- and region-local - key/value pair. -+ ------------- -{ - "event":"data", - ... - "repo":1, # optional - "t_abs":0.024107, # absolute elapsed time - "t_rel":0.001031, # elapsed time in region/thread - "nesting":2, # region stack depth - "category":"index", - "key":"read/cache_nr", - "value":"3552" -} ------------- -+ -The "value" field may be an integer or a string. - -`"data-json"`:: - This event is generated to log a pre-formatted JSON string - containing structured data. -+ ------------- -{ - "event":"data_json", - ... - "repo":1, # optional - "t_abs":0.015905, - "t_rel":0.015905, - "nesting":1, - "category":"process", - "key":"windows/ancestry", - "value":["bash.exe","bash.exe"] -} ------------- - -== Example Trace2 API Usage - -Here is a hypothetical usage of the Trace2 API showing the intended -usage (without worrying about the actual Git details). - -Initialization:: - - Initialization happens in `main()`. Behind the scenes, an - `atexit` and `signal` handler are registered. -+ ----------------- -int main(int argc, const char **argv) -{ - int exit_code; - - trace2_initialize(); - trace2_cmd_start(argv); - - exit_code = cmd_main(argc, argv); - - trace2_cmd_exit(exit_code); - - return exit_code; -} ----------------- - -Command Details:: - - After the basics are established, additional command - information can be sent to Trace2 as it is discovered. -+ ----------------- -int cmd_checkout(int argc, const char **argv) -{ - trace2_cmd_name("checkout"); - trace2_cmd_mode("branch"); - trace2_def_repo(the_repository); - - // emit "def_param" messages for "interesting" config settings. - trace2_cmd_list_config(); - - if (do_something()) - trace2_cmd_error("Path '%s': cannot do something", path); - - return 0; -} ----------------- - -Child Processes:: - - Wrap code spawning child processes. -+ ----------------- -void run_child(...) -{ - int child_exit_code; - struct child_process cmd = CHILD_PROCESS_INIT; - ... - cmd.trace2_child_class = "editor"; - - trace2_child_start(&cmd); - child_exit_code = spawn_child_and_wait_for_it(); - trace2_child_exit(&cmd, child_exit_code); -} ----------------- -+ -For example, the following fetch command spawned ssh, index-pack, -rev-list, and gc. This example also shows that fetch took -5.199 seconds and of that 4.932 was in ssh. -+ ----------------- -$ export GIT_TRACE2_BRIEF=1 -$ export GIT_TRACE2=~/log.normal -$ git fetch origin -... ----------------- -+ ----------------- -$ cat ~/log.normal -version 2.20.1.vfs.1.1.47.g534dbe1ad1 -start git fetch origin -worktree /Users/jeffhost/work/gfw -cmd_name fetch (fetch) -child_start[0] ssh git@github.com ... -child_start[1] git index-pack ... -... (Trace2 events from child processes omitted) -child_exit[1] pid:14707 code:0 elapsed:0.076353 -child_exit[0] pid:14706 code:0 elapsed:4.931869 -child_start[2] git rev-list ... -... (Trace2 events from child process omitted) -child_exit[2] pid:14708 code:0 elapsed:0.110605 -child_start[3] git gc --auto -... (Trace2 events from child process omitted) -child_exit[3] pid:14709 code:0 elapsed:0.006240 -exit elapsed:5.198503 code:0 -atexit elapsed:5.198541 code:0 ----------------- -+ -When a git process is a (direct or indirect) child of another -git process, it inherits Trace2 context information. This -allows the child to print the command hierarchy. This example -shows gc as child[3] of fetch. When the gc process reports -its name as "gc", it also reports the hierarchy as "fetch/gc". -(In this example, trace2 messages from the child process is -indented for clarity.) -+ ----------------- -$ export GIT_TRACE2_BRIEF=1 -$ export GIT_TRACE2=~/log.normal -$ git fetch origin -... ----------------- -+ ----------------- -$ cat ~/log.normal -version 2.20.1.160.g5676107ecd.dirty -start git fetch official -worktree /Users/jeffhost/work/gfw -cmd_name fetch (fetch) -... -child_start[3] git gc --auto - version 2.20.1.160.g5676107ecd.dirty - start /Users/jeffhost/work/gfw/git gc --auto - worktree /Users/jeffhost/work/gfw - cmd_name gc (fetch/gc) - exit elapsed:0.001959 code:0 - atexit elapsed:0.001997 code:0 -child_exit[3] pid:20303 code:0 elapsed:0.007564 -exit elapsed:3.868938 code:0 -atexit elapsed:3.868970 code:0 ----------------- - -Regions:: - - Regions can be use to time an interesting section of code. -+ ----------------- -void wt_status_collect(struct wt_status *s) -{ - trace2_region_enter("status", "worktrees", s->repo); - wt_status_collect_changes_worktree(s); - trace2_region_leave("status", "worktrees", s->repo); - - trace2_region_enter("status", "index", s->repo); - wt_status_collect_changes_index(s); - trace2_region_leave("status", "index", s->repo); - - trace2_region_enter("status", "untracked", s->repo); - wt_status_collect_untracked(s); - trace2_region_leave("status", "untracked", s->repo); -} - -void wt_status_print(struct wt_status *s) -{ - trace2_region_enter("status", "print", s->repo); - switch (s->status_format) { - ... - } - trace2_region_leave("status", "print", s->repo); -} ----------------- -+ -In this example, scanning for untracked files ran from +0.012568 to -+0.027149 (since the process started) and took 0.014581 seconds. -+ ----------------- -$ export GIT_TRACE2_PERF_BRIEF=1 -$ export GIT_TRACE2_PERF=~/log.perf -$ git status -... - -$ cat ~/log.perf -d0 | main | version | | | | | 2.20.1.160.g5676107ecd.dirty -d0 | main | start | | 0.001173 | | | git status -d0 | main | def_repo | r1 | | | | worktree:/Users/jeffhost/work/gfw -d0 | main | cmd_name | | | | | status (status) -... -d0 | main | region_enter | r1 | 0.010988 | | status | label:worktrees -d0 | main | region_leave | r1 | 0.011236 | 0.000248 | status | label:worktrees -d0 | main | region_enter | r1 | 0.011260 | | status | label:index -d0 | main | region_leave | r1 | 0.012542 | 0.001282 | status | label:index -d0 | main | region_enter | r1 | 0.012568 | | status | label:untracked -d0 | main | region_leave | r1 | 0.027149 | 0.014581 | status | label:untracked -d0 | main | region_enter | r1 | 0.027411 | | status | label:print -d0 | main | region_leave | r1 | 0.028741 | 0.001330 | status | label:print -d0 | main | exit | | 0.028778 | | | code:0 -d0 | main | atexit | | 0.028809 | | | code:0 ----------------- -+ -Regions may be nested. This causes messages to be indented in the -PERF target, for example. -Elapsed times are relative to the start of the corresponding nesting -level as expected. For example, if we add region message to: -+ ----------------- -static enum path_treatment read_directory_recursive(struct dir_struct *dir, - struct index_state *istate, const char *base, int baselen, - struct untracked_cache_dir *untracked, int check_only, - int stop_at_first_file, const struct pathspec *pathspec) -{ - enum path_treatment state, subdir_state, dir_state = path_none; - - trace2_region_enter_printf("dir", "read_recursive", NULL, "%.*s", baselen, base); - ... - trace2_region_leave_printf("dir", "read_recursive", NULL, "%.*s", baselen, base); - return dir_state; -} ----------------- -+ -We can further investigate the time spent scanning for untracked files. -+ ----------------- -$ export GIT_TRACE2_PERF_BRIEF=1 -$ export GIT_TRACE2_PERF=~/log.perf -$ git status -... -$ cat ~/log.perf -d0 | main | version | | | | | 2.20.1.162.gb4ccea44db.dirty -d0 | main | start | | 0.001173 | | | git status -d0 | main | def_repo | r1 | | | | worktree:/Users/jeffhost/work/gfw -d0 | main | cmd_name | | | | | status (status) -... -d0 | main | region_enter | r1 | 0.015047 | | status | label:untracked -d0 | main | region_enter | | 0.015132 | | dir | ..label:read_recursive -d0 | main | region_enter | | 0.016341 | | dir | ....label:read_recursive vcs-svn/ -d0 | main | region_leave | | 0.016422 | 0.000081 | dir | ....label:read_recursive vcs-svn/ -d0 | main | region_enter | | 0.016446 | | dir | ....label:read_recursive xdiff/ -d0 | main | region_leave | | 0.016522 | 0.000076 | dir | ....label:read_recursive xdiff/ -d0 | main | region_enter | | 0.016612 | | dir | ....label:read_recursive git-gui/ -d0 | main | region_enter | | 0.016698 | | dir | ......label:read_recursive git-gui/po/ -d0 | main | region_enter | | 0.016810 | | dir | ........label:read_recursive git-gui/po/glossary/ -d0 | main | region_leave | | 0.016863 | 0.000053 | dir | ........label:read_recursive git-gui/po/glossary/ -... -d0 | main | region_enter | | 0.031876 | | dir | ....label:read_recursive builtin/ -d0 | main | region_leave | | 0.032270 | 0.000394 | dir | ....label:read_recursive builtin/ -d0 | main | region_leave | | 0.032414 | 0.017282 | dir | ..label:read_recursive -d0 | main | region_leave | r1 | 0.032454 | 0.017407 | status | label:untracked -... -d0 | main | exit | | 0.034279 | | | code:0 -d0 | main | atexit | | 0.034322 | | | code:0 ----------------- -+ -Trace2 regions are similar to the existing trace_performance_enter() -and trace_performance_leave() routines, but are thread safe and -maintain per-thread stacks of timers. - -Data Messages:: - - Data messages added to a region. -+ ----------------- -int read_index_from(struct index_state *istate, const char *path, - const char *gitdir) -{ - trace2_region_enter_printf("index", "do_read_index", the_repository, "%s", path); - - ... - - trace2_data_intmax("index", the_repository, "read/version", istate->version); - trace2_data_intmax("index", the_repository, "read/cache_nr", istate->cache_nr); - - trace2_region_leave_printf("index", "do_read_index", the_repository, "%s", path); -} ----------------- -+ -This example shows that the index contained 3552 entries. -+ ----------------- -$ export GIT_TRACE2_PERF_BRIEF=1 -$ export GIT_TRACE2_PERF=~/log.perf -$ git status -... -$ cat ~/log.perf -d0 | main | version | | | | | 2.20.1.156.gf9916ae094.dirty -d0 | main | start | | 0.001173 | | | git status -d0 | main | def_repo | r1 | | | | worktree:/Users/jeffhost/work/gfw -d0 | main | cmd_name | | | | | status (status) -d0 | main | region_enter | r1 | 0.001791 | | index | label:do_read_index .git/index -d0 | main | data | r1 | 0.002494 | 0.000703 | index | ..read/version:2 -d0 | main | data | r1 | 0.002520 | 0.000729 | index | ..read/cache_nr:3552 -d0 | main | region_leave | r1 | 0.002539 | 0.000748 | index | label:do_read_index .git/index -... ----------------- - -Thread Events:: - - Thread messages added to a thread-proc. -+ -For example, the multithreaded preload-index code can be -instrumented with a region around the thread pool and then -per-thread start and exit events within the threadproc. -+ ----------------- -static void *preload_thread(void *_data) -{ - // start the per-thread clock and emit a message. - trace2_thread_start("preload_thread"); - - // report which chunk of the array this thread was assigned. - trace2_data_intmax("index", the_repository, "offset", p->offset); - trace2_data_intmax("index", the_repository, "count", nr); - - do { - ... - } while (--nr > 0); - ... - - // report elapsed time taken by this thread. - trace2_thread_exit(); - return NULL; -} - -void preload_index(struct index_state *index, - const struct pathspec *pathspec, - unsigned int refresh_flags) -{ - trace2_region_enter("index", "preload", the_repository); - - for (i = 0; i < threads; i++) { - ... /* create thread */ - } - - for (i = 0; i < threads; i++) { - ... /* join thread */ - } - - trace2_region_leave("index", "preload", the_repository); -} ----------------- -+ -In this example preload_index() was executed by the `main` thread -and started the `preload` region. Seven threads, named -`th01:preload_thread` through `th07:preload_thread`, were started. -Events from each thread are atomically appended to the shared target -stream as they occur so they may appear in random order with respect -other threads. Finally, the main thread waits for the threads to -finish and leaves the region. -+ -Data events are tagged with the active thread name. They are used -to report the per-thread parameters. -+ ----------------- -$ export GIT_TRACE2_PERF_BRIEF=1 -$ export GIT_TRACE2_PERF=~/log.perf -$ git status -... -$ cat ~/log.perf -... -d0 | main | region_enter | r1 | 0.002595 | | index | label:preload -d0 | th01:preload_thread | thread_start | | 0.002699 | | | -d0 | th02:preload_thread | thread_start | | 0.002721 | | | -d0 | th01:preload_thread | data | r1 | 0.002736 | 0.000037 | index | offset:0 -d0 | th02:preload_thread | data | r1 | 0.002751 | 0.000030 | index | offset:2032 -d0 | th03:preload_thread | thread_start | | 0.002711 | | | -d0 | th06:preload_thread | thread_start | | 0.002739 | | | -d0 | th01:preload_thread | data | r1 | 0.002766 | 0.000067 | index | count:508 -d0 | th06:preload_thread | data | r1 | 0.002856 | 0.000117 | index | offset:2540 -d0 | th03:preload_thread | data | r1 | 0.002824 | 0.000113 | index | offset:1016 -d0 | th04:preload_thread | thread_start | | 0.002710 | | | -d0 | th02:preload_thread | data | r1 | 0.002779 | 0.000058 | index | count:508 -d0 | th06:preload_thread | data | r1 | 0.002966 | 0.000227 | index | count:508 -d0 | th07:preload_thread | thread_start | | 0.002741 | | | -d0 | th07:preload_thread | data | r1 | 0.003017 | 0.000276 | index | offset:3048 -d0 | th05:preload_thread | thread_start | | 0.002712 | | | -d0 | th05:preload_thread | data | r1 | 0.003067 | 0.000355 | index | offset:1524 -d0 | th05:preload_thread | data | r1 | 0.003090 | 0.000378 | index | count:508 -d0 | th07:preload_thread | data | r1 | 0.003037 | 0.000296 | index | count:504 -d0 | th03:preload_thread | data | r1 | 0.002971 | 0.000260 | index | count:508 -d0 | th04:preload_thread | data | r1 | 0.002983 | 0.000273 | index | offset:508 -d0 | th04:preload_thread | data | r1 | 0.007311 | 0.004601 | index | count:508 -d0 | th05:preload_thread | thread_exit | | 0.008781 | 0.006069 | | -d0 | th01:preload_thread | thread_exit | | 0.009561 | 0.006862 | | -d0 | th03:preload_thread | thread_exit | | 0.009742 | 0.007031 | | -d0 | th06:preload_thread | thread_exit | | 0.009820 | 0.007081 | | -d0 | th02:preload_thread | thread_exit | | 0.010274 | 0.007553 | | -d0 | th07:preload_thread | thread_exit | | 0.010477 | 0.007736 | | -d0 | th04:preload_thread | thread_exit | | 0.011657 | 0.008947 | | -d0 | main | region_leave | r1 | 0.011717 | 0.009122 | index | label:preload -... -d0 | main | exit | | 0.029996 | | | code:0 -d0 | main | atexit | | 0.030027 | | | code:0 ----------------- -+ -In this example, the preload region took 0.009122 seconds. The 7 threads -took between 0.006069 and 0.008947 seconds to work on their portion of -the index. Thread "th01" worked on 508 items at offset 0. Thread "th02" -worked on 508 items at offset 2032. Thread "th04" worked on 508 items -at offset 508. -+ -This example also shows that thread names are assigned in a racy manner -as each thread starts and allocates TLS storage. - -== Future Work - -=== Relationship to the Existing Trace Api (api-trace.txt) - -There are a few issues to resolve before we can completely -switch to Trace2. - -* Updating existing tests that assume GIT_TRACE format messages. - -* How to best handle custom GIT_TRACE_<key> messages? - -** The GIT_TRACE_<key> mechanism allows each <key> to write to a -different file (in addition to just stderr). - -** Do we want to maintain that ability or simply write to the existing -Trace2 targets (and convert <key> to a "category"). diff --git a/third_party/git/Documentation/technical/bitmap-format.txt b/third_party/git/Documentation/technical/bitmap-format.txt deleted file mode 100644 index f8c18a0f7aec..000000000000 --- a/third_party/git/Documentation/technical/bitmap-format.txt +++ /dev/null @@ -1,164 +0,0 @@ -GIT bitmap v1 format -==================== - - - A header appears at the beginning: - - 4-byte signature: {'B', 'I', 'T', 'M'} - - 2-byte version number (network byte order) - The current implementation only supports version 1 - of the bitmap index (the same one as JGit). - - 2-byte flags (network byte order) - - The following flags are supported: - - - BITMAP_OPT_FULL_DAG (0x1) REQUIRED - This flag must always be present. It implies that the bitmap - index has been generated for a packfile with full closure - (i.e. where every single object in the packfile can find - its parent links inside the same packfile). This is a - requirement for the bitmap index format, also present in JGit, - that greatly reduces the complexity of the implementation. - - - BITMAP_OPT_HASH_CACHE (0x4) - If present, the end of the bitmap file contains - `N` 32-bit name-hash values, one per object in the - pack. The format and meaning of the name-hash is - described below. - - 4-byte entry count (network byte order) - - The total count of entries (bitmapped commits) in this bitmap index. - - 20-byte checksum - - The SHA1 checksum of the pack this bitmap index belongs to. - - - 4 EWAH bitmaps that act as type indexes - - Type indexes are serialized after the hash cache in the shape - of four EWAH bitmaps stored consecutively (see Appendix A for - the serialization format of an EWAH bitmap). - - There is a bitmap for each Git object type, stored in the following - order: - - - Commits - - Trees - - Blobs - - Tags - - In each bitmap, the `n`th bit is set to true if the `n`th object - in the packfile is of that type. - - The obvious consequence is that the OR of all 4 bitmaps will result - in a full set (all bits set), and the AND of all 4 bitmaps will - result in an empty bitmap (no bits set). - - - N entries with compressed bitmaps, one for each indexed commit - - Where `N` is the total amount of entries in this bitmap index. - Each entry contains the following: - - - 4-byte object position (network byte order) - The position **in the index for the packfile** where the - bitmap for this commit is found. - - - 1-byte XOR-offset - The xor offset used to compress this bitmap. For an entry - in position `x`, a XOR offset of `y` means that the actual - bitmap representing this commit is composed by XORing the - bitmap for this entry with the bitmap in entry `x-y` (i.e. - the bitmap `y` entries before this one). - - Note that this compression can be recursive. In order to - XOR this entry with a previous one, the previous entry needs - to be decompressed first, and so on. - - The hard-limit for this offset is 160 (an entry can only be - xor'ed against one of the 160 entries preceding it). This - number is always positive, and hence entries are always xor'ed - with **previous** bitmaps, not bitmaps that will come afterwards - in the index. - - - 1-byte flags for this bitmap - At the moment the only available flag is `0x1`, which hints - that this bitmap can be re-used when rebuilding bitmap indexes - for the repository. - - - The compressed bitmap itself, see Appendix A. - -== Appendix A: Serialization format for an EWAH bitmap - -Ewah bitmaps are serialized in the same protocol as the JAVAEWAH -library, making them backwards compatible with the JGit -implementation: - - - 4-byte number of bits of the resulting UNCOMPRESSED bitmap - - - 4-byte number of words of the COMPRESSED bitmap, when stored - - - N x 8-byte words, as specified by the previous field - - This is the actual content of the compressed bitmap. - - - 4-byte position of the current RLW for the compressed - bitmap - -All words are stored in network byte order for their corresponding -sizes. - -The compressed bitmap is stored in a form of run-length encoding, as -follows. It consists of a concatenation of an arbitrary number of -chunks. Each chunk consists of one or more 64-bit words - - H L_1 L_2 L_3 .... L_M - -H is called RLW (run length word). It consists of (from lower to higher -order bits): - - - 1 bit: the repeated bit B - - - 32 bits: repetition count K (unsigned) - - - 31 bits: literal word count M (unsigned) - -The bitstream represented by the above chunk is then: - - - K repetitions of B - - - The bits stored in `L_1` through `L_M`. Within a word, bits at - lower order come earlier in the stream than those at higher - order. - -The next word after `L_M` (if any) must again be a RLW, for the next -chunk. For efficient appending to the bitstream, the EWAH stores a -pointer to the last RLW in the stream. - - -== Appendix B: Optional Bitmap Sections - -These sections may or may not be present in the `.bitmap` file; their -presence is indicated by the header flags section described above. - -Name-hash cache ---------------- - -If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains -a cache of 32-bit values, one per object in the pack. The value at -position `i` is the hash of the pathname at which the `i`th object -(counting in index order) in the pack can be found. This can be fed -into the delta heuristics to compare objects with similar pathnames. - -The hash algorithm used is: - - hash = 0; - while ((c = *name++)) - if (!isspace(c)) - hash = (hash >> 2) + (c << 24); - -Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. -If implementations want to choose a different hashing scheme, they are -free to do so, but MUST allocate a new header flag (because comparing -hashes made under two different schemes would be pointless). diff --git a/third_party/git/Documentation/technical/bundle-format.txt b/third_party/git/Documentation/technical/bundle-format.txt deleted file mode 100644 index bac558d049a3..000000000000 --- a/third_party/git/Documentation/technical/bundle-format.txt +++ /dev/null @@ -1,76 +0,0 @@ -= Git bundle v2 format - -The Git bundle format is a format that represents both refs and Git objects. - -== Format - -We will use ABNF notation to define the Git bundle format. See -protocol-common.txt for the details. - -A v2 bundle looks like this: - ----- -bundle = signature *prerequisite *reference LF pack -signature = "# v2 git bundle" LF - -prerequisite = "-" obj-id SP comment LF -comment = *CHAR -reference = obj-id SP refname LF - -pack = ... ; packfile ----- - -A v3 bundle looks like this: - ----- -bundle = signature *capability *prerequisite *reference LF pack -signature = "# v3 git bundle" LF - -capability = "@" key ["=" value] LF -prerequisite = "-" obj-id SP comment LF -comment = *CHAR -reference = obj-id SP refname LF -key = 1*(ALPHA / DIGIT / "-") -value = *(%01-09 / %0b-FF) - -pack = ... ; packfile ----- - -== Semantics - -A Git bundle consists of several parts. - -* "Capabilities", which are only in the v3 format, indicate functionality that - the bundle requires to be read properly. - -* "Prerequisites" lists the objects that are NOT included in the bundle and the - reader of the bundle MUST already have, in order to use the data in the - bundle. The objects stored in the bundle may refer to prerequisite objects and - anything reachable from them (e.g. a tree object in the bundle can reference - a blob that is reachable from a prerequisite) and/or expressed as a delta - against prerequisite objects. - -* "References" record the tips of the history graph, iow, what the reader of the - bundle CAN "git fetch" from it. - -* "Pack" is the pack data stream "git fetch" would send, if you fetch from a - repository that has the references recorded in the "References" above into a - repository that has references pointing at the objects listed in - "Prerequisites" above. - -In the bundle format, there can be a comment following a prerequisite obj-id. -This is a comment and it has no specific meaning. The writer of the bundle MAY -put any string here. The reader of the bundle MUST ignore the comment. - -=== Note on the shallow clone and a Git bundle - -Note that the prerequisites does not represent a shallow-clone boundary. The -semantics of the prerequisites and the shallow-clone boundaries are different, -and the Git bundle v2 format cannot represent a shallow clone repository. - -== Capabilities - -Because there is no opportunity for negotiation, unknown capabilities cause 'git -bundle' to abort. The only known capability is `object-format`, which specifies -the hash algorithm in use, and can take the same values as the -`extensions.objectFormat` configuration value. diff --git a/third_party/git/Documentation/technical/commit-graph-format.txt b/third_party/git/Documentation/technical/commit-graph-format.txt deleted file mode 100644 index b3b58880b926..000000000000 --- a/third_party/git/Documentation/technical/commit-graph-format.txt +++ /dev/null @@ -1,139 +0,0 @@ -Git commit graph format -======================= - -The Git commit graph stores a list of commit OIDs and some associated -metadata, including: - -- The generation number of the commit. Commits with no parents have - generation number 1; commits with parents have generation number - one more than the maximum generation number of its parents. We - reserve zero as special, and can be used to mark a generation - number invalid or as "not computed". - -- The root tree OID. - -- The commit date. - -- The parents of the commit, stored using positional references within - the graph file. - -- The Bloom filter of the commit carrying the paths that were changed between - the commit and its first parent, if requested. - -These positional references are stored as unsigned 32-bit integers -corresponding to the array position within the list of commit OIDs. Due -to some special constants we use to track parents, we can store at most -(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits. - -== Commit graph files have the following format: - -In order to allow extensions that add extra data to the graph, we organize -the body into "chunks" and provide a binary lookup table at the beginning -of the body. The header includes certain values, such as number of chunks -and hash type. - -All multi-byte numbers are in network byte order. - -HEADER: - - 4-byte signature: - The signature is: {'C', 'G', 'P', 'H'} - - 1-byte version number: - Currently, the only valid version is 1. - - 1-byte Hash Version - We infer the hash length (H) from this value: - 1 => SHA-1 - 2 => SHA-256 - If the hash type does not match the repository's hash algorithm, the - commit-graph file should be ignored with a warning presented to the - user. - - 1-byte number (C) of "chunks" - - 1-byte number (B) of base commit-graphs - We infer the length (H*B) of the Base Graphs chunk - from this value. - -CHUNK LOOKUP: - - (C + 1) * 12 bytes listing the table of contents for the chunks: - First 4 bytes describe the chunk id. Value 0 is a terminating label. - Other 8 bytes provide the byte-offset in current file for chunk to - start. (Chunks are ordered contiguously in the file, so you can infer - the length using the next chunk position if necessary.) Each chunk - ID appears at most once. - - The remaining data in the body is described one chunk at a time, and - these chunks may be given in any order. Chunks are required unless - otherwise specified. - -CHUNK DATA: - - OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) - The ith entry, F[i], stores the number of OIDs with first - byte at most i. Thus F[255] stores the total - number of commits (N). - - OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) - The OIDs for all commits in the graph, sorted in ascending order. - - Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) - * The first H bytes are for the OID of the root tree. - * The next 8 bytes are for the positions of the first two parents - of the ith commit. Stores value 0x70000000 if no parent in that - position. If there are more than two parents, the second value - has its most-significant bit on and the other bits store an array - position into the Extra Edge List chunk. - * The next 8 bytes store the generation number of the commit and - the commit time in seconds since EPOCH. The generation number - uses the higher 30 bits of the first 4 bytes, while the commit - time uses the 32 bits of the second 4 bytes, along with the lowest - 2 bits of the lowest byte, storing the 33rd and 34th bit of the - commit time. - - Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] - This list of 4-byte values store the second through nth parents for - all octopus merges. The second parent value in the commit data stores - an array position within this list along with the most-significant bit - on. Starting at that array position, iterate through this list of commit - positions for the parents until reaching a value with the most-significant - bit on. The other bits correspond to the position of the last parent. - - Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional] - * The ith entry, BIDX[i], stores the number of bytes in all Bloom filters - from commit 0 to commit i (inclusive) in lexicographic order. The Bloom - filter for the i-th commit spans from BIDX[i-1] to BIDX[i] (plus header - length), where BIDX[-1] is 0. - * The BIDX chunk is ignored if the BDAT chunk is not present. - - Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional] - * It starts with header consisting of three unsigned 32-bit integers: - - Version of the hash algorithm being used. We currently only support - value 1 which corresponds to the 32-bit version of the murmur3 hash - implemented exactly as described in - https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double - hashing technique using seed values 0x293ae76f and 0x7e646e2 as - described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters - in Probabilistic Verification" - - The number of times a path is hashed and hence the number of bit positions - that cumulatively determine whether a file is present in the commit. - - The minimum number of bits 'b' per entry in the Bloom filter. If the filter - contains 'n' entries, then the filter size is the minimum number of 64-bit - words that contain n*b bits. - * The rest of the chunk is the concatenation of all the computed Bloom - filters for the commits in lexicographic order. - * Note: Commits with no changes or more than 512 changes have Bloom filters - of length one, with either all bits set to zero or one respectively. - * The BDAT chunk is present if and only if BIDX is present. - - Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional] - This list of H-byte hashes describe a set of B commit-graph files that - form a commit-graph chain. The graph position for the ith commit in this - file's OID Lookup chunk is equal to i plus the number of commits in all - base graphs. If B is non-zero, this chunk must exist. - -TRAILER: - - H-byte HASH-checksum of all of the above. diff --git a/third_party/git/Documentation/technical/commit-graph.txt b/third_party/git/Documentation/technical/commit-graph.txt deleted file mode 100644 index f14a7659aa87..000000000000 --- a/third_party/git/Documentation/technical/commit-graph.txt +++ /dev/null @@ -1,350 +0,0 @@ -Git Commit Graph Design Notes -============================= - -Git walks the commit graph for many reasons, including: - -1. Listing and filtering commit history. -2. Computing merge bases. - -These operations can become slow as the commit count grows. The merge -base calculation shows up in many user-facing commands, such as 'merge-base' -or 'status' and can take minutes to compute depending on history shape. - -There are two main costs here: - -1. Decompressing and parsing commits. -2. Walking the entire graph to satisfy topological order constraints. - -The commit-graph file is a supplemental data structure that accelerates -commit graph walks. If a user downgrades or disables the 'core.commitGraph' -config setting, then the existing ODB is sufficient. The file is stored -as "commit-graph" either in the .git/objects/info directory or in the info -directory of an alternate. - -The commit-graph file stores the commit graph structure along with some -extra metadata to speed up graph walks. By listing commit OIDs in -lexicographic order, we can identify an integer position for each commit -and refer to the parents of a commit using those integer positions. We -use binary search to find initial commits and then use the integer -positions for fast lookups during the walk. - -A consumer may load the following info for a commit from the graph: - -1. The commit OID. -2. The list of parents, along with their integer position. -3. The commit date. -4. The root tree OID. -5. The generation number (see definition below). - -Values 1-4 satisfy the requirements of parse_commit_gently(). - -Define the "generation number" of a commit recursively as follows: - - * A commit with no parents (a root commit) has generation number one. - - * A commit with at least one parent has generation number one more than - the largest generation number among its parents. - -Equivalently, the generation number of a commit A is one more than the -length of a longest path from A to a root commit. The recursive definition -is easier to use for computation and observing the following property: - - If A and B are commits with generation numbers N and M, respectively, - and N <= M, then A cannot reach B. That is, we know without searching - that B is not an ancestor of A because it is further from a root commit - than A. - - Conversely, when checking if A is an ancestor of B, then we only need - to walk commits until all commits on the walk boundary have generation - number at most N. If we walk commits using a priority queue seeded by - generation numbers, then we always expand the boundary commit with highest - generation number and can easily detect the stopping condition. - -This property can be used to significantly reduce the time it takes to -walk commits and determine topological relationships. Without generation -numbers, the general heuristic is the following: - - If A and B are commits with commit time X and Y, respectively, and - X < Y, then A _probably_ cannot reach B. - -This heuristic is currently used whenever the computation is allowed to -violate topological relationships due to clock skew (such as "git log" -with default order), but is not used when the topological order is -required (such as merge base calculations, "git log --graph"). - -In practice, we expect some commits to be created recently and not stored -in the commit graph. We can treat these commits as having "infinite" -generation number and walk until reaching commits with known generation -number. - -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not -in the commit-graph file. If a commit-graph file was written by a version -of Git that did not compute generation numbers, then those commits will -have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. - -Since the commit-graph file is closed under reachability, we can guarantee -the following weaker condition on all commits: - - If A and B are commits with generation numbers N and M, respectively, - and N < M, then A cannot reach B. - -Note how the strict inequality differs from the inequality when we have -fully-computed generation numbers. Using strict inequality may result in -walking a few extra commits, but the simplicity in dealing with commits -with generation number *_INFINITY or *_ZERO is valuable. - -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose -generation numbers are computed to be at least this value. We limit at -this value since it is the largest value that can be stored in the -commit-graph file using the 30 bits available to generation numbers. This -presents another case where a commit can have generation number equal to -that of a parent. - -Design Details --------------- - -- The commit-graph file is stored in a file named 'commit-graph' in the - .git/objects/info directory. This could be stored in the info directory - of an alternate. - -- The core.commitGraph config setting must be on to consume graph files. - -- The file format includes parameters for the object ID hash function, - so a future change of hash algorithm does not require a change in format. - -- Commit grafts and replace objects can change the shape of the commit - history. The latter can also be enabled/disabled on the fly using - `--no-replace-objects`. This leads to difficultly storing both possible - interpretations of a commit id, especially when computing generation - numbers. The commit-graph will not be read or written when - replace-objects or grafts are present. - -- Shallow clones create grafts of commits by dropping their parents. This - leads the commit-graph to think those commits have generation number 1. - If and when those commits are made unshallow, those generation numbers - become invalid. Since shallow clones are intended to restrict the commit - history to a very small set of commits, the commit-graph feature is less - helpful for these clones, anyway. The commit-graph will not be read or - written when shallow commits are present. - -Commit Graphs Chains --------------------- - -Typically, repos grow with near-constant velocity (commits per day). Over time, -the number of commits added by a fetch operation is much smaller than the -number of commits in the full history. By creating a "chain" of commit-graphs, -we enable fast writes of new commit data without rewriting the entire commit -history -- at least, most of the time. - -## File Layout - -A commit-graph chain uses multiple files, and we use a fixed naming convention -to organize these files. Each commit-graph file has a name -`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex- -valued hash stored in the footer of that file (which is a hash of the file's -contents before that hash). For a chain of commit-graph files, a plain-text -file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the -hashes for the files in order from "lowest" to "highest". - -For example, if the `commit-graph-chain` file contains the lines - -``` - {hash0} - {hash1} - {hash2} -``` - -then the commit-graph chain looks like the following diagram: - - +-----------------------+ - | graph-{hash2}.graph | - +-----------------------+ - | - +-----------------------+ - | | - | graph-{hash1}.graph | - | | - +-----------------------+ - | - +-----------------------+ - | | - | | - | | - | graph-{hash0}.graph | - | | - | | - | | - +-----------------------+ - -Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of -commits in `graph-{hash1}.graph`, and X2 be the number of commits in -`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`, -then we interpret this as being the commit in position (X0 + X1 + i), and that -will be used as its "graph position". The commits in `graph-{hash2}.graph` use these -positions to refer to their parents, which may be in `graph-{hash1}.graph` or -`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking -its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + -X2). - -Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data -specifying the hashes of all files in the lower layers. In the above example, -`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains -`{hash0}` and `{hash1}`. - -## Merging commit-graph files - -If we only added a new commit-graph file on every write, we would run into a -linear search problem through many commit-graph files. Instead, we use a merge -strategy to decide when the stack should collapse some number of levels. - -The diagram below shows such a collapse. As a set of new commits are added, it -is determined by the merge strategy that the files should collapse to -`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and -the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` -file. - - +---------------------+ - | | - | (new commits) | - | | - +---------------------+ - | | - +-----------------------+ +---------------------+ - | graph-{hash2} |->| | - +-----------------------+ +---------------------+ - | | | - +-----------------------+ +---------------------+ - | | | | - | graph-{hash1} |->| | - | | | | - +-----------------------+ +---------------------+ - | tmp_graphXXX - +-----------------------+ - | | - | | - | | - | graph-{hash0} | - | | - | | - | | - +-----------------------+ - -During this process, the commits to write are combined, sorted and we write the -contents to a temporary file, all while holding a `commit-graph-chain.lock` -lock-file. When the file is flushed, we rename it to `graph-{hash3}` -according to the computed `{hash3}`. Finally, we write the new chain data to -`commit-graph-chain.lock`: - -``` - {hash3} - {hash0} -``` - -We then close the lock-file. - -## Merge Strategy - -When writing a set of commits that do not exist in the commit-graph stack of -height N, we default to creating a new file at level N + 1. We then decide to -merge with the Nth level if one of two conditions hold: - - 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in - level N is less than X times the number of commits in level N + 1. - - 2. `--max-commits=<C>` is specified with non-zero C and the number of commits - in level N + 1 is more than C commits. - -This decision cascades down the levels: when we merge a level we create a new -set of commits that then compares to the next level. - -The first condition bounds the number of levels to be logarithmic in the total -number of commits. The second condition bounds the total number of commits in -a `graph-{hashN}` file and not in the `commit-graph` file, preventing -significant performance issues when the stack merges and another process only -partially reads the previous stack. - -The merge strategy values (2 for the size multiple, 64,000 for the maximum -number of commits) could be extracted into config settings for full -flexibility. - -## Deleting graph-{hash} files - -After a new tip file is written, some `graph-{hash}` files may no longer -be part of a chain. It is important to remove these files from disk, eventually. -The main reason to delay removal is that another process could read the -`commit-graph-chain` file before it is rewritten, but then look for the -`graph-{hash}` files after they are deleted. - -To allow holding old split commit-graphs for a while after they are unreferenced, -we update the modified times of the files when they become unreferenced. Then, -we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}` -files whose modified times are older than a given expiry window. This window -defaults to zero, but can be changed using command-line arguments or a config -setting. - -## Chains across multiple object directories - -In a repo with alternates, we look for the `commit-graph-chain` file starting -in the local object directory and then in each alternate. The first file that -exists defines our chain. As we look for the `graph-{hash}` files for -each `{hash}` in the chain file, we follow the same pattern for the host -directories. - -This allows commit-graphs to be split across multiple forks in a fork network. -The typical case is a large "base" repo with many smaller forks. - -As the base repo advances, it will likely update and merge its commit-graph -chain more frequently than the forks. If a fork updates their commit-graph after -the base repo, then it should "reparent" the commit-graph chain onto the new -chain in the base repo. When reading each `graph-{hash}` file, we track -the object directory containing it. During a write of a new commit-graph file, -we check for any changes in the source object directory and read the -`commit-graph-chain` file for that source and create a new file based on those -files. During this "reparent" operation, we necessarily need to collapse all -levels in the fork, as all of the files are invalid against the new base file. - -It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph` -files in this scenario. It falls to the user to define the proper settings for -their custom environment: - - 1. When merging levels in the base repo, the unreferenced files may still be - referenced by chains from fork repos. - - 2. The expiry time should be set to a length of time such that every fork has - time to recompute their commit-graph chain to "reparent" onto the new base - file(s). - - 3. If the commit-graph chain is updated in the base, the fork will not have - access to the new chain until its chain is updated to reference those files. - (This may change in the future [5].) - -Related Links -------------- -[0] https://bugs.chromium.org/p/git/issues/detail?id=8 - Chromium work item for: Serialized Commit Graph - -[1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/ - An abandoned patch that introduced generation numbers. - -[2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ - Discussion about generation numbers on commits and how they interact - with fsck. - -[3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ - More discussion about generation numbers and not storing them inside - commit objects. A valuable quote: - - "I think we should be moving more in the direction of keeping - repo-local caches for optimizations. Reachability bitmaps have been - a big performance win. I think we should be doing the same with our - properties of commits. Not just generation numbers, but making it - cheap to access the graph structure without zlib-inflating whole - commit objects (i.e., packv4 or something like the "metapacks" I - proposed a few years ago)." - -[4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u - A patch to remove the ahead-behind calculation from 'status'. - -[5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/ - A discussion of a "two-dimensional graph position" that can allow reading - multiple commit-graph chains at the same time. diff --git a/third_party/git/Documentation/technical/directory-rename-detection.txt b/third_party/git/Documentation/technical/directory-rename-detection.txt deleted file mode 100644 index 844629c8c441..000000000000 --- a/third_party/git/Documentation/technical/directory-rename-detection.txt +++ /dev/null @@ -1,115 +0,0 @@ -Directory rename detection -========================== - -Rename detection logic in diffcore-rename that checks for renames of -individual files is aggregated and analyzed in merge-recursive for cases -where combinations of renames indicate that a full directory has been -renamed. - -Scope of abilities ------------------- - -It is perhaps easiest to start with an example: - - * When all of x/a, x/b and x/c have moved to z/a, z/b and z/c, it is - likely that x/d added in the meantime would also want to move to z/d by - taking the hint that the entire directory 'x' moved to 'z'. - -More interesting possibilities exist, though, such as: - - * one side of history renames x -> z, and the other renames some file to - x/e, causing the need for the merge to do a transitive rename. - - * one side of history renames x -> z, but also renames all files within x. - For example, x/a -> z/alpha, x/b -> z/bravo, etc. - - * both 'x' and 'y' being merged into a single directory 'z', with a - directory rename being detected for both x->z and y->z. - - * not all files in a directory being renamed to the same location; - i.e. perhaps most the files in 'x' are now found under 'z', but a few - are found under 'w'. - - * a directory being renamed, which also contained a subdirectory that was - renamed to some entirely different location. (And perhaps the inner - directory itself contained inner directories that were renamed to yet - other locations). - - * combinations of the above; see t/t6043-merge-rename-directories.sh for - various interesting cases. - -Limitations -- applicability of directory renames -------------------------------------------------- - -In order to prevent edge and corner cases resulting in either conflicts -that cannot be represented in the index or which might be too complex for -users to try to understand and resolve, a couple basic rules limit when -directory rename detection applies: - - 1) If a given directory still exists on both sides of a merge, we do - not consider it to have been renamed. - - 2) If a subset of to-be-renamed files have a file or directory in the - way (or would be in the way of each other), "turn off" the directory - rename for those specific sub-paths and report the conflict to the - user. - - 3) If the other side of history did a directory rename to a path that - your side of history renamed away, then ignore that particular - rename from the other side of history for any implicit directory - renames (but warn the user). - -Limitations -- detailed rules and testcases -------------------------------------------- - -t/t6043-merge-rename-directories.sh contains extensive tests and commentary -which generate and explore the rules listed above. It also lists a few -additional rules: - - a) If renames split a directory into two or more others, the directory - with the most renames, "wins". - - b) Avoid directory-rename-detection for a path, if that path is the - source of a rename on either side of a merge. - - c) Only apply implicit directory renames to directories if the other side - of history is the one doing the renaming. - -Limitations -- support in different commands --------------------------------------------- - -Directory rename detection is supported by 'merge' and 'cherry-pick'. -Other git commands which users might be surprised to see limited or no -directory rename detection support in: - - * diff - - Folks have requested in the past that `git diff` detect directory - renames and somehow simplify its output. It is not clear whether this - would be desirable or how the output should be simplified, so this was - simply not implemented. Further, to implement this, directory rename - detection logic would need to move from merge-recursive to - diffcore-rename. - - * am - - git-am tries to avoid a full three way merge, instead calling - git-apply. That prevents us from detecting renames at all, which may - defeat the directory rename detection. There is a fallback, though; if - the initial git-apply fails and the user has specified the -3 option, - git-am will fall back to a three way merge. However, git-am lacks the - necessary information to do a "real" three way merge. Instead, it has - to use build_fake_ancestor() to get a merge base that is missing files - whose rename may have been important to detect for directory rename - detection to function. - - * rebase - - Since am-based rebases work by first generating a bunch of patches - (which no longer record what the original commits were and thus don't - have the necessary info from which we can find a real merge-base), and - then calling git-am, this implies that am-based rebases will not always - successfully detect directory renames either (see the 'am' section - above). merged-based rebases (rebase -m) and cherry-pick-based rebases - (rebase -i) are not affected by this shortcoming, and fully support - directory rename detection. diff --git a/third_party/git/Documentation/technical/hash-function-transition.txt b/third_party/git/Documentation/technical/hash-function-transition.txt deleted file mode 100644 index 6fd20ebbc254..000000000000 --- a/third_party/git/Documentation/technical/hash-function-transition.txt +++ /dev/null @@ -1,827 +0,0 @@ -Git hash function transition -============================ - -Objective ---------- -Migrate Git from SHA-1 to a stronger hash function. - -Background ----------- -At its core, the Git version control system is a content addressable -filesystem. It uses the SHA-1 hash function to name content. For -example, files, directories, and revisions are referred to by hash -values unlike in other traditional version control systems where files -or versions are referred to via sequential numbers. The use of a hash -function to address its content delivers a few advantages: - -* Integrity checking is easy. Bit flips, for example, are easily - detected, as the hash of corrupted content does not match its name. -* Lookup of objects is fast. - -Using a cryptographically secure hash function brings additional -advantages: - -* Object names can be signed and third parties can trust the hash to - address the signed object and all objects it references. -* Communication using Git protocol and out of band communication - methods have a short reliable string that can be used to reliably - address stored content. - -Over time some flaws in SHA-1 have been discovered by security -researchers. On 23 February 2017 the SHAttered attack -(https://shattered.io) demonstrated a practical SHA-1 hash collision. - -Git v2.13.0 and later subsequently moved to a hardened SHA-1 -implementation by default, which isn't vulnerable to the SHAttered -attack. - -Thus Git has in effect already migrated to a new hash that isn't SHA-1 -and doesn't share its vulnerabilities, its new hash function just -happens to produce exactly the same output for all known inputs, -except two PDFs published by the SHAttered researchers, and the new -implementation (written by those researchers) claims to detect future -cryptanalytic collision attacks. - -Regardless, it's considered prudent to move past any variant of SHA-1 -to a new hash. There's no guarantee that future attacks on SHA-1 won't -be published in the future, and those attacks may not have viable -mitigations. - -If SHA-1 and its variants were to be truly broken, Git's hash function -could not be considered cryptographically secure any more. This would -impact the communication of hash values because we could not trust -that a given hash value represented the known good version of content -that the speaker intended. - -SHA-1 still possesses the other properties such as fast object lookup -and safe error checking, but other hash functions are equally suitable -that are believed to be cryptographically secure. - -Goals ------ -1. The transition to SHA-256 can be done one local repository at a time. - a. Requiring no action by any other party. - b. A SHA-256 repository can communicate with SHA-1 Git servers - (push/fetch). - c. Users can use SHA-1 and SHA-256 identifiers for objects - interchangeably (see "Object names on the command line", below). - d. New signed objects make use of a stronger hash function than - SHA-1 for their security guarantees. -2. Allow a complete transition away from SHA-1. - a. Local metadata for SHA-1 compatibility can be removed from a - repository if compatibility with SHA-1 is no longer needed. -3. Maintainability throughout the process. - a. The object format is kept simple and consistent. - b. Creation of a generalized repository conversion tool. - -Non-Goals ---------- -1. Add SHA-256 support to Git protocol. This is valuable and the - logical next step but it is out of scope for this initial design. -2. Transparently improving the security of existing SHA-1 signed - objects. -3. Intermixing objects using multiple hash functions in a single - repository. -4. Taking the opportunity to fix other bugs in Git's formats and - protocols. -5. Shallow clones and fetches into a SHA-256 repository. (This will - change when we add SHA-256 support to Git protocol.) -6. Skip fetching some submodules of a project into a SHA-256 - repository. (This also depends on SHA-256 support in Git - protocol.) - -Overview --------- -We introduce a new repository format extension. Repositories with this -extension enabled use SHA-256 instead of SHA-1 to name their objects. -This affects both object names and object content --- both the names -of objects and all references to other objects within an object are -switched to the new hash function. - -SHA-256 repositories cannot be read by older versions of Git. - -Alongside the packfile, a SHA-256 repository stores a bidirectional -mapping between SHA-256 and SHA-1 object names. The mapping is generated -locally and can be verified using "git fsck". Object lookups use this -mapping to allow naming objects using either their SHA-1 and SHA-256 names -interchangeably. - -"git cat-file" and "git hash-object" gain options to display an object -in its sha1 form and write an object given its sha1 form. This -requires all objects referenced by that object to be present in the -object database so that they can be named using the appropriate name -(using the bidirectional hash mapping). - -Fetches from a SHA-1 based server convert the fetched objects into -SHA-256 form and record the mapping in the bidirectional mapping table -(see below for details). Pushes to a SHA-1 based server convert the -objects being pushed into sha1 form so the server does not have to be -aware of the hash function the client is using. - -Detailed Design ---------------- -Repository format extension -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A SHA-256 repository uses repository format version `1` (see -Documentation/technical/repository-version.txt) with extensions -`objectFormat` and `compatObjectFormat`: - - [core] - repositoryFormatVersion = 1 - [extensions] - objectFormat = sha256 - compatObjectFormat = sha1 - -The combination of setting `core.repositoryFormatVersion=1` and -populating `extensions.*` ensures that all versions of Git later than -`v0.99.9l` will die instead of trying to operate on the SHA-256 -repository, instead producing an error message. - - # Between v0.99.9l and v2.7.0 - $ git status - fatal: Expected git repo version <= 0, found 1 - # After v2.7.0 - $ git status - fatal: unknown repository extensions found: - objectformat - compatobjectformat - -See the "Transition plan" section below for more details on these -repository extensions. - -Object names -~~~~~~~~~~~~ -Objects can be named by their 40 hexadecimal digit sha1-name or 64 -hexadecimal digit sha256-name, plus names derived from those (see -gitrevisions(7)). - -The sha1-name of an object is the SHA-1 of the concatenation of its -type, length, a nul byte, and the object's sha1-content. This is the -traditional <sha1> used in Git to name objects. - -The sha256-name of an object is the SHA-256 of the concatenation of its -type, length, a nul byte, and the object's sha256-content. - -Object format -~~~~~~~~~~~~~ -The content as a byte sequence of a tag, commit, or tree object named -by sha1 and sha256 differ because an object named by sha256-name refers to -other objects by their sha256-names and an object named by sha1-name -refers to other objects by their sha1-names. - -The sha256-content of an object is the same as its sha1-content, except -that objects referenced by the object are named using their sha256-names -instead of sha1-names. Because a blob object does not refer to any -other object, its sha1-content and sha256-content are the same. - -The format allows round-trip conversion between sha256-content and -sha1-content. - -Object storage -~~~~~~~~~~~~~~ -Loose objects use zlib compression and packed objects use the packed -format described in Documentation/technical/pack-format.txt, just like -today. The content that is compressed and stored uses sha256-content -instead of sha1-content. - -Pack index -~~~~~~~~~~ -Pack index (.idx) files use a new v3 format that supports multiple -hash functions. They have the following format (all integers are in -network byte order): - -- A header appears at the beginning and consists of the following: - - The 4-byte pack index signature: '\377t0c' - - 4-byte version number: 3 - - 4-byte length of the header section, including the signature and - version number - - 4-byte number of objects contained in the pack - - 4-byte number of object formats in this pack index: 2 - - For each object format: - - 4-byte format identifier (e.g., 'sha1' for SHA-1) - - 4-byte length in bytes of shortened object names. This is the - shortest possible length needed to make names in the shortened - object name table unambiguous. - - 4-byte integer, recording where tables relating to this format - are stored in this index file, as an offset from the beginning. - - 4-byte offset to the trailer from the beginning of this file. - - Zero or more additional key/value pairs (4-byte key, 4-byte - value). Only one key is supported: 'PSRC'. See the "Loose objects - and unreachable objects" section for supported values and how this - is used. All other keys are reserved. Readers must ignore - unrecognized keys. -- Zero or more NUL bytes. This can optionally be used to improve the - alignment of the full object name table below. -- Tables for the first object format: - - A sorted table of shortened object names. These are prefixes of - the names of all objects in this pack file, packed together - without offset values to reduce the cache footprint of the binary - search for a specific object name. - - - A table of full object names in pack order. This allows resolving - a reference to "the nth object in the pack file" (from a - reachability bitmap or from the next table of another object - format) to its object name. - - - A table of 4-byte values mapping object name order to pack order. - For an object in the table of sorted shortened object names, the - value at the corresponding index in this table is the index in the - previous table for that same object. - - This can be used to look up the object in reachability bitmaps or - to look up its name in another object format. - - - A table of 4-byte CRC32 values of the packed object data, in the - order that the objects appear in the pack file. This is to allow - compressed data to be copied directly from pack to pack during - repacking without undetected data corruption. - - - A table of 4-byte offset values. For an object in the table of - sorted shortened object names, the value at the corresponding - index in this table indicates where that object can be found in - the pack file. These are usually 31-bit pack file offsets, but - large offsets are encoded as an index into the next table with the - most significant bit set. - - - A table of 8-byte offset entries (empty for pack files less than - 2 GiB). Pack files are organized with heavily used objects toward - the front, so most object references should not need to refer to - this table. -- Zero or more NUL bytes. -- Tables for the second object format, with the same layout as above, - up to and not including the table of CRC32 values. -- Zero or more NUL bytes. -- The trailer consists of the following: - - A copy of the 20-byte SHA-256 checksum at the end of the - corresponding packfile. - - - 20-byte SHA-256 checksum of all of the above. - -Loose object index -~~~~~~~~~~~~~~~~~~ -A new file $GIT_OBJECT_DIR/loose-object-idx contains information about -all loose objects. Its format is - - # loose-object-idx - (sha256-name SP sha1-name LF)* - -where the object names are in hexadecimal format. The file is not -sorted. - -The loose object index is protected against concurrent writes by a -lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose -object: - -1. Write the loose object to a temporary file, like today. -2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. -3. Rename the loose object into place. -4. Open loose-object-idx with O_APPEND and write the new object -5. Unlink loose-object-idx.lock to release the lock. - -To remove entries (e.g. in "git pack-refs" or "git-prune"): - -1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the - lock. -2. Write the new content to loose-object-idx.lock. -3. Unlink any loose objects being removed. -4. Rename to replace loose-object-idx, releasing the lock. - -Translation table -~~~~~~~~~~~~~~~~~ -The index files support a bidirectional mapping between sha1-names -and sha256-names. The lookup proceeds similarly to ordinary object -lookups. For example, to convert a sha1-name to a sha256-name: - - 1. Look for the object in idx files. If a match is present in the - idx's sorted list of truncated sha1-names, then: - a. Read the corresponding entry in the sha1-name order to pack - name order mapping. - b. Read the corresponding entry in the full sha1-name table to - verify we found the right object. If it is, then - c. Read the corresponding entry in the full sha256-name table. - That is the object's sha256-name. - 2. Check for a loose object. Read lines from loose-object-idx until - we find a match. - -Step (1) takes the same amount of time as an ordinary object lookup: -O(number of packs * log(objects per pack)). Step (2) takes O(number of -loose objects) time. To maintain good performance it will be necessary -to keep the number of loose objects low. See the "Loose objects and -unreachable objects" section below for more details. - -Since all operations that make new objects (e.g., "git commit") add -the new objects to the corresponding index, this mapping is possible -for all objects in the object store. - -Reading an object's sha1-content -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The sha1-content of an object can be read by converting all sha256-names -its sha256-content references to sha1-names using the translation table. - -Fetch -~~~~~ -Fetching from a SHA-1 based server requires translating between SHA-1 -and SHA-256 based representations on the fly. - -SHA-1s named in the ref advertisement that are present on the client -can be translated to SHA-256 and looked up as local objects using the -translation table. - -Negotiation proceeds as today. Any "have"s generated locally are -converted to SHA-1 before being sent to the server, and SHA-1s -mentioned by the server are converted to SHA-256 when looking them up -locally. - -After negotiation, the server sends a packfile containing the -requested objects. We convert the packfile to SHA-256 format using -the following steps: - -1. index-pack: inflate each object in the packfile and compute its - SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against - objects the client has locally. These objects can be looked up - using the translation table and their sha1-content read as - described above to resolve the deltas. -2. topological sort: starting at the "want"s from the negotiation - phase, walk through objects in the pack and emit a list of them, - excluding blobs, in reverse topologically sorted order, with each - object coming later in the list than all objects it references. - (This list only contains objects reachable from the "wants". If the - pack from the server contained additional extraneous objects, then - they will be discarded.) -3. convert to sha256: open a new (sha256) packfile. Read the topologically - sorted list just generated. For each object, inflate its - sha1-content, convert to sha256-content, and write it to the sha256 - pack. Record the new sha1<->sha256 mapping entry for use in the idx. -4. sort: reorder entries in the new pack to match the order of objects - in the pack the server generated and include blobs. Write a sha256 idx - file -5. clean up: remove the SHA-1 based pack file, index, and - topologically sorted list obtained from the server in steps 1 - and 2. - -Step 3 requires every object referenced by the new object to be in the -translation table. This is why the topological sort step is necessary. - -As an optimization, step 1 could write a file describing what non-blob -objects each object it has inflated from the packfile references. This -makes the topological sort in step 2 possible without inflating the -objects in the packfile for a second time. The objects need to be -inflated again in step 3, for a total of two inflations. - -Step 4 is probably necessary for good read-time performance. "git -pack-objects" on the server optimizes the pack file for good data -locality (see Documentation/technical/pack-heuristics.txt). - -Details of this process are likely to change. It will take some -experimenting to get this to perform well. - -Push -~~~~ -Push is simpler than fetch because the objects referenced by the -pushed objects are already in the translation table. The sha1-content -of each object being pushed can be read as described in the "Reading -an object's sha1-content" section to generate the pack written by git -send-pack. - -Signed Commits -~~~~~~~~~~~~~~ -We add a new field "gpgsig-sha256" to the commit object format to allow -signing commits without relying on SHA-1. It is similar to the -existing "gpgsig" field. Its signed payload is the sha256-content of the -commit object with any "gpgsig" and "gpgsig-sha256" fields removed. - -This means commits can be signed -1. using SHA-1 only, as in existing signed commit objects -2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig - fields. -3. using only SHA-256, by only using the gpgsig-sha256 field. - -Old versions of "git verify-commit" can verify the gpgsig signature in -cases (1) and (2) without modifications and view case (3) as an -ordinary unsigned commit. - -Signed Tags -~~~~~~~~~~~ -We add a new field "gpgsig-sha256" to the tag object format to allow -signing tags without relying on SHA-1. Its signed payload is the -sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP -SIGNATURE-----" delimited in-body signature removed. - -This means tags can be signed -1. using SHA-1 only, as in existing signed tag objects -2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body - signature. -3. using only SHA-256, by only using the gpgsig-sha256 field. - -Mergetag embedding -~~~~~~~~~~~~~~~~~~ -The mergetag field in the sha1-content of a commit contains the -sha1-content of a tag that was merged by that commit. - -The mergetag field in the sha256-content of the same commit contains the -sha256-content of the same tag. - -Submodules -~~~~~~~~~~ -To convert recorded submodule pointers, you need to have the converted -submodule repository in place. The translation table of the submodule -can be used to look up the new hash. - -Loose objects and unreachable objects -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Fast lookups in the loose-object-idx require that the number of loose -objects not grow too high. - -"git gc --auto" currently waits for there to be 6700 loose objects -present before consolidating them into a packfile. We will need to -measure to find a more appropriate threshold for it to use. - -"git gc --auto" currently waits for there to be 50 packs present -before combining packfiles. Packing loose objects more aggressively -may cause the number of pack files to grow too quickly. This can be -mitigated by using a strategy similar to Martin Fick's exponential -rolling garbage collection script: -https://gerrit-review.googlesource.com/c/gerrit/+/35215 - -"git gc" currently expels any unreachable objects it encounters in -pack files to loose objects in an attempt to prevent a race when -pruning them (in case another process is simultaneously writing a new -object that refers to the about-to-be-deleted object). This leads to -an explosion in the number of loose objects present and disk space -usage due to the objects in delta form being replaced with independent -loose objects. Worse, the race is still present for loose objects. - -Instead, "git gc" will need to move unreachable objects to a new -packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see -below). To avoid the race when writing new objects referring to an -about-to-be-deleted object, code paths that write new objects will -need to copy any objects from UNREACHABLE_GARBAGE packs that they -refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). -UNREACHABLE_GARBAGE are then safe to delete if their creation time (as -indicated by the file's mtime) is long enough ago. - -To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be -combined under certain circumstances. If "gc.garbageTtl" is set to -greater than one day, then packs created within a single calendar day, -UTC, can be coalesced together. The resulting packfile would have an -mtime before midnight on that day, so this makes the effective maximum -ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, -then we divide the calendar day into intervals one-third of that ttl -in duration. Packs created within the same interval can be coalesced -together. The resulting packfile would have an mtime before the end of -the interval, so this makes the effective maximum ttl equal to the -garbageTtl * 4/3. - -This rule comes from Thirumala Reddy Mutchukota's JGit change -https://git.eclipse.org/r/90465. - -The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack -index. More generally, that field indicates where a pack came from: - - - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network - - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight - "gc --auto" operation - - 3 (PACK_SOURCE_GC) for a pack created by a full gc - - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage - discovered by gc - - 5 (PACK_SOURCE_INSERT) for locally created objects that were - written directly to a pack file, e.g. from "git add ." - -This information can be useful for debugging and for "gc --auto" to -make appropriate choices about which packs to coalesce. - -Caveats -------- -Invalid objects -~~~~~~~~~~~~~~~ -The conversion from sha1-content to sha256-content retains any -brokenness in the original object (e.g., tree entry modes encoded with -leading 0, tree objects whose paths are not sorted correctly, and -commit objects without an author or committer). This is a deliberate -feature of the design to allow the conversion to round-trip. - -More profoundly broken objects (e.g., a commit with a truncated "tree" -header line) cannot be converted but were not usable by current Git -anyway. - -Shallow clone and submodules -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Because it requires all referenced objects to be available in the -locally generated translation table, this design does not support -shallow clone or unfetched submodules. Protocol improvements might -allow lifting this restriction. - -Alternates -~~~~~~~~~~ -For the same reason, a sha256 repository cannot borrow objects from a -sha1 repository using objects/info/alternates or -$GIT_ALTERNATE_OBJECT_REPOSITORIES. - -git notes -~~~~~~~~~ -The "git notes" tool annotates objects using their sha1-name as key. -This design does not describe a way to migrate notes trees to use -sha256-names. That migration is expected to happen separately (for -example using a file at the root of the notes tree to describe which -hash it uses). - -Server-side cost -~~~~~~~~~~~~~~~~ -Until Git protocol gains SHA-256 support, using SHA-256 based storage -on public-facing Git servers is strongly discouraged. Once Git -protocol gains SHA-256 support, SHA-256 based servers are likely not -to support SHA-1 compatibility, to avoid what may be a very expensive -hash re-encode during clone and to encourage peers to modernize. - -The design described here allows fetches by SHA-1 clients of a -personal SHA-256 repository because it's not much more difficult than -allowing pushes from that repository. This support needs to be guarded -by a configuration option --- servers like git.kernel.org that serve a -large number of clients would not be expected to bear that cost. - -Meaning of signatures -~~~~~~~~~~~~~~~~~~~~~ -The signed payload for signed commits and tags does not explicitly -name the hash used to identify objects. If some day Git adopts a new -hash function with the same length as the current SHA-1 (40 -hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the -intent behind the PGP signed payload in an object signature is -unclear: - - object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 - type commit - tag v2.12.0 - tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 - - Git 2.12 - -Does this mean Git v2.12.0 is the commit with sha1-name -e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with -new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? - -Fortunately SHA-256 and SHA-1 have different lengths. If Git starts -using another hash with the same length to name objects, then it will -need to change the format of signed payloads using that hash to -address this issue. - -Object names on the command line -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -To support the transition (see Transition plan below), this design -supports four different modes of operation: - - 1. ("dark launch") Treat object names input by the user as SHA-1 and - convert any object names written to output to SHA-1, but store - objects using SHA-256. This allows users to test the code with no - visible behavior change except for performance. This allows - allows running even tests that assume the SHA-1 hash function, to - sanity-check the behavior of the new mode. - - 2. ("early transition") Allow both SHA-1 and SHA-256 object names in - input. Any object names written to output use SHA-1. This allows - users to continue to make use of SHA-1 to communicate with peers - (e.g. by email) that have not migrated yet and prepares for mode 3. - - 3. ("late transition") Allow both SHA-1 and SHA-256 object names in - input. Any object names written to output use SHA-256. In this - mode, users are using a more secure object naming method by - default. The disruption is minimal as long as most of their peers - are in mode 2 or mode 3. - - 4. ("post-transition") Treat object names input by the user as - SHA-256 and write output using SHA-256. This is safer than mode 3 - because there is less risk that input is incorrectly interpreted - using the wrong hash function. - -The mode is specified in configuration. - -The user can also explicitly specify which format to use for a -particular revision specifier and for output, overriding the mode. For -example: - -git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} - -Choice of Hash --------------- -In early 2005, around the time that Git was written, Xiaoyun Wang, -Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 -collisions in 2^69 operations. In August they published details. -Luckily, no practical demonstrations of a collision in full SHA-1 were -published until 10 years later, in 2017. - -Git v2.13.0 and later subsequently moved to a hardened SHA-1 -implementation by default that mitigates the SHAttered attack, but -SHA-1 is still believed to be weak. - -The hash to replace this hardened SHA-1 should be stronger than SHA-1 -was: we would like it to be trustworthy and useful in practice for at -least 10 years. - -Some other relevant properties: - -1. A 256-bit hash (long enough to match common security practice; not - excessively long to hurt performance and disk usage). - -2. High quality implementations should be widely available (e.g., in - OpenSSL and Apple CommonCrypto). - -3. The hash function's properties should match Git's needs (e.g. Git - requires collision and 2nd preimage resistance and does not require - length extension resistance). - -4. As a tiebreaker, the hash should be fast to compute (fortunately - many contenders are faster than SHA-1). - -We choose SHA-256. - -Transition plan ---------------- -Some initial steps can be implemented independently of one another: -- adding a hash function API (vtable) -- teaching fsck to tolerate the gpgsig-sha256 field -- excluding gpgsig-* from the fields copied by "git commit --amend" -- annotating tests that depend on SHA-1 values with a SHA1 test - prerequisite -- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ - consistently instead of "unsigned char *" and the hardcoded - constants 20 and 40. -- introducing index v3 -- adding support for the PSRC field and safer object pruning - - -The first user-visible change is the introduction of the objectFormat -extension (without compatObjectFormat). This requires: -- teaching fsck about this mode of operation -- using the hash function API (vtable) when computing object names -- signing objects and verifying signatures -- rejecting attempts to fetch from or push to an incompatible - repository - -Next comes introduction of compatObjectFormat: -- implementing the loose-object-idx -- translating object names between object formats -- translating object content between object formats -- generating and verifying signatures in the compat format -- adding appropriate index entries when adding a new object to the - object store -- --output-format option -- ^{sha1} and ^{sha256} revision notation -- configuration to specify default input and output format (see - "Object names on the command line" above) - -The next step is supporting fetches and pushes to SHA-1 repositories: -- allow pushes to a repository using the compat format -- generate a topologically sorted list of the SHA-1 names of fetched - objects -- convert the fetched packfile to sha256 format and generate an idx - file -- re-sort to match the order of objects in the fetched packfile - -The infrastructure supporting fetch also allows converting an existing -repository. In converted repositories and new clones, end users can -gain support for the new hash function without any visible change in -behavior (see "dark launch" in the "Object names on the command line" -section). In particular this allows users to verify SHA-256 signatures -on objects in the repository, and it should ensure the transition code -is stable in production in preparation for using it more widely. - -Over time projects would encourage their users to adopt the "early -transition" and then "late transition" modes to take advantage of the -new, more futureproof SHA-256 object names. - -When objectFormat and compatObjectFormat are both set, commands -generating signatures would generate both SHA-1 and SHA-256 signatures -by default to support both new and old users. - -In projects using SHA-256 heavily, users could be encouraged to adopt -the "post-transition" mode to avoid accidentally making implicit use -of SHA-1 object names. - -Once a critical mass of users have upgraded to a version of Git that -can verify SHA-256 signatures and have converted their existing -repositories to support verifying them, we can add support for a -setting to generate only SHA-256 signatures. This is expected to be at -least a year later. - -That is also a good moment to advertise the ability to convert -repositories to use SHA-256 only, stripping out all SHA-1 related -metadata. This improves performance by eliminating translation -overhead and security by avoiding the possibility of accidentally -relying on the safety of SHA-1. - -Updating Git's protocols to allow a server to specify which hash -functions it supports is also an important part of this transition. It -is not discussed in detail in this document but this transition plan -assumes it happens. :) - -Alternatives considered ------------------------ -Upgrading everyone working on a particular project on a flag day -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Projects like the Linux kernel are large and complex enough that -flipping the switch for all projects based on the repository at once -is infeasible. - -Not only would all developers and server operators supporting -developers have to switch on the same flag day, but supporting tooling -(continuous integration, code review, bug trackers, etc) would have to -be adapted as well. This also makes it difficult to get early feedback -from some project participants testing before it is time for mass -adoption. - -Using hash functions in parallel -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) -Objects newly created would be addressed by the new hash, but inside -such an object (e.g. commit) it is still possible to address objects -using the old hash function. -* You cannot trust its history (needed for bisectability) in the - future without further work -* Maintenance burden as the number of supported hash functions grows - (they will never go away, so they accumulate). In this proposal, by - comparison, converted objects lose all references to SHA-1. - -Signed objects with multiple hashes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Instead of introducing the gpgsig-sha256 field in commit and tag objects -for sha256-content based signatures, an earlier version of this design -added "hash sha256 <sha256-name>" fields to strengthen the existing -sha1-content based signatures. - -In other words, a single signature was used to attest to the object -content using both hash functions. This had some advantages: -* Using one signature instead of two speeds up the signing process. -* Having one signed payload with both hashes allows the signer to - attest to the sha1-name and sha256-name referring to the same object. -* All users consume the same signature. Broken signatures are likely - to be detected quickly using current versions of git. - -However, it also came with disadvantages: -* Verifying a signed object requires access to the sha1-names of all - objects it references, even after the transition is complete and - translation table is no longer needed for anything else. To support - this, the design added fields such as "hash sha1 tree <sha1-name>" - and "hash sha1 parent <sha1-name>" to the sha256-content of a signed - commit, complicating the conversion process. -* Allowing signed objects without a sha1 (for after the transition is - complete) complicated the design further, requiring a "nohash sha1" - field to suppress including "hash sha1" fields in the sha256-content - and signed payload. - -Lazily populated translation table -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some of the work of building the translation table could be deferred to -push time, but that would significantly complicate and slow down pushes. -Calculating the sha1-name at object creation time at the same time it is -being streamed to disk and having its sha256-name calculated should be -an acceptable cost. - -Document History ----------------- - -2017-03-03 -bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, -sbeller@google.com - -Initial version sent to -http://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com - -2017-03-03 jrnieder@gmail.com -Incorporated suggestions from jonathantanmy and sbeller: -* describe purpose of signed objects with each hash type -* redefine signed object verification using object content under the - first hash function - -2017-03-06 jrnieder@gmail.com -* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] -* Make sha3-based signatures a separate field, avoiding the need for - "hash" and "nohash" fields (thanks to peff[3]). -* Add a sorting phase to fetch (thanks to Junio for noticing the need - for this). -* Omit blobs from the topological sort during fetch (thanks to peff). -* Discuss alternates, git notes, and git servers in the caveats - section (thanks to Junio Hamano, brian m. carlson[4], and Shawn - Pearce). -* Clarify language throughout (thanks to various commenters, - especially Junio). - -2017-09-27 jrnieder@gmail.com, sbeller@google.com -* use placeholder NewHash instead of SHA3-256 -* describe criteria for picking a hash function. -* include a transition plan (thanks especially to Brandon Williams - for fleshing these ideas out) -* define the translation table (thanks, Shawn Pearce[5], Jonathan - Tan, and Masaya Suzuki) -* avoid loose object overhead by packing more aggressively in - "git gc --auto" - -Later history: - - See the history of this file in git.git for the history of subsequent - edits. This document history is no longer being maintained as it - would now be superfluous to the commit log - -[1] http://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ -[2] http://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ -[3] http://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ -[4] http://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net -[5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ diff --git a/third_party/git/Documentation/technical/http-protocol.txt b/third_party/git/Documentation/technical/http-protocol.txt deleted file mode 100644 index 96d89ea9b226..000000000000 --- a/third_party/git/Documentation/technical/http-protocol.txt +++ /dev/null @@ -1,519 +0,0 @@ -HTTP transfer protocols -======================= - -Git supports two HTTP based transfer protocols. A "dumb" protocol -which requires only a standard HTTP server on the server end of the -connection, and a "smart" protocol which requires a Git aware CGI -(or server module). This document describes both protocols. - -As a design feature smart clients can automatically upgrade "dumb" -protocol URLs to smart URLs. This permits all users to have the -same published URL, and the peers automatically select the most -efficient transport available to them. - - -URL Format ----------- - -URLs for Git repositories accessed by HTTP use the standard HTTP -URL syntax documented by RFC 1738, so they are of the form: - - http://<host>:<port>/<path>?<searchpart> - -Within this documentation the placeholder `$GIT_URL` will stand for -the http:// repository URL entered by the end-user. - -Servers SHOULD handle all requests to locations matching `$GIT_URL`, as -both the "smart" and "dumb" HTTP protocols used by Git operate -by appending additional path components onto the end of the user -supplied `$GIT_URL` string. - -An example of a dumb client requesting for a loose object: - - $GIT_URL: http://example.com:8080/git/repo.git - URL request: http://example.com:8080/git/repo.git/objects/d0/49f6c27a2244e12041955e262a404c7faba355 - -An example of a smart request to a catch-all gateway: - - $GIT_URL: http://example.com/daemon.cgi?svc=git&q= - URL request: http://example.com/daemon.cgi?svc=git&q=/info/refs&service=git-receive-pack - -An example of a request to a submodule: - - $GIT_URL: http://example.com/git/repo.git/path/submodule.git - URL request: http://example.com/git/repo.git/path/submodule.git/info/refs - -Clients MUST strip a trailing `/`, if present, from the user supplied -`$GIT_URL` string to prevent empty path tokens (`//`) from appearing -in any URL sent to a server. Compatible clients MUST expand -`$GIT_URL/info/refs` as `foo/info/refs` and not `foo//info/refs`. - - -Authentication --------------- - -Standard HTTP authentication is used if authentication is required -to access a repository, and MAY be configured and enforced by the -HTTP server software. - -Because Git repositories are accessed by standard path components -server administrators MAY use directory based permissions within -their HTTP server to control repository access. - -Clients SHOULD support Basic authentication as described by RFC 2617. -Servers SHOULD support Basic authentication by relying upon the -HTTP server placed in front of the Git server software. - -Servers SHOULD NOT require HTTP cookies for the purposes of -authentication or access control. - -Clients and servers MAY support other common forms of HTTP based -authentication, such as Digest authentication. - - -SSL ---- - -Clients and servers SHOULD support SSL, particularly to protect -passwords when relying on Basic HTTP authentication. - - -Session State -------------- - -The Git over HTTP protocol (much like HTTP itself) is stateless -from the perspective of the HTTP server side. All state MUST be -retained and managed by the client process. This permits simple -round-robin load-balancing on the server side, without needing to -worry about state management. - -Clients MUST NOT require state management on the server side in -order to function correctly. - -Servers MUST NOT require HTTP cookies in order to function correctly. -Clients MAY store and forward HTTP cookies during request processing -as described by RFC 2616 (HTTP/1.1). Servers SHOULD ignore any -cookies sent by a client. - - -General Request Processing --------------------------- - -Except where noted, all standard HTTP behavior SHOULD be assumed -by both client and server. This includes (but is not necessarily -limited to): - -If there is no repository at `$GIT_URL`, or the resource pointed to by a -location matching `$GIT_URL` does not exist, the server MUST NOT respond -with `200 OK` response. A server SHOULD respond with -`404 Not Found`, `410 Gone`, or any other suitable HTTP status code -which does not imply the resource exists as requested. - -If there is a repository at `$GIT_URL`, but access is not currently -permitted, the server MUST respond with the `403 Forbidden` HTTP -status code. - -Servers SHOULD support both HTTP 1.0 and HTTP 1.1. -Servers SHOULD support chunked encoding for both request and response -bodies. - -Clients SHOULD support both HTTP 1.0 and HTTP 1.1. -Clients SHOULD support chunked encoding for both request and response -bodies. - -Servers MAY return ETag and/or Last-Modified headers. - -Clients MAY revalidate cached entities by including If-Modified-Since -and/or If-None-Match request headers. - -Servers MAY return `304 Not Modified` if the relevant headers appear -in the request and the entity has not changed. Clients MUST treat -`304 Not Modified` identical to `200 OK` by reusing the cached entity. - -Clients MAY reuse a cached entity without revalidation if the -Cache-Control and/or Expires header permits caching. Clients and -servers MUST follow RFC 2616 for cache controls. - - -Discovering References ----------------------- - -All HTTP clients MUST begin either a fetch or a push exchange by -discovering the references available on the remote repository. - -Dumb Clients -~~~~~~~~~~~~ - -HTTP clients that only support the "dumb" protocol MUST discover -references by making a request for the special info/refs file of -the repository. - -Dumb HTTP clients MUST make a `GET` request to `$GIT_URL/info/refs`, -without any search/query parameters. - - C: GET $GIT_URL/info/refs HTTP/1.0 - - S: 200 OK - S: - S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint - S: d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master - S: 2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0 - S: a3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{} - -The Content-Type of the returned info/refs entity SHOULD be -`text/plain; charset=utf-8`, but MAY be any content type. -Clients MUST NOT attempt to validate the returned Content-Type. -Dumb servers MUST NOT return a return type starting with -`application/x-git-`. - -Cache-Control headers MAY be returned to disable caching of the -returned entity. - -When examining the response clients SHOULD only examine the HTTP -status code. Valid responses are `200 OK`, or `304 Not Modified`. - -The returned content is a UNIX formatted text file describing -each ref and its known value. The file SHOULD be sorted by name -according to the C locale ordering. The file SHOULD NOT include -the default ref named `HEAD`. - - info_refs = *( ref_record ) - ref_record = any_ref / peeled_ref - - any_ref = obj-id HTAB refname LF - peeled_ref = obj-id HTAB refname LF - obj-id HTAB refname "^{}" LF - -Smart Clients -~~~~~~~~~~~~~ - -HTTP clients that support the "smart" protocol (or both the -"smart" and "dumb" protocols) MUST discover references by making -a parameterized request for the info/refs file of the repository. - -The request MUST contain exactly one query parameter, -`service=$servicename`, where `$servicename` MUST be the service -name the client wishes to contact to complete the operation. -The request MUST NOT contain additional query parameters. - - C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 - -dumb server reply: - - S: 200 OK - S: - S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint - S: d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master - S: 2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0 - S: a3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{} - -smart server reply: - - S: 200 OK - S: Content-Type: application/x-git-upload-pack-advertisement - S: Cache-Control: no-cache - S: - S: 001e# service=git-upload-pack\n - S: 0000 - S: 004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint\0multi_ack\n - S: 003fd049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n - S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n - S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n - S: 0000 - -The client may send Extra Parameters (see -Documentation/technical/pack-protocol.txt) as a colon-separated string -in the Git-Protocol HTTP header. - -Dumb Server Response -^^^^^^^^^^^^^^^^^^^^ -Dumb servers MUST respond with the dumb server reply format. - -See the prior section under dumb clients for a more detailed -description of the dumb server response. - -Smart Server Response -^^^^^^^^^^^^^^^^^^^^^ -If the server does not recognize the requested service name, or the -requested service name has been disabled by the server administrator, -the server MUST respond with the `403 Forbidden` HTTP status code. - -Otherwise, smart servers MUST respond with the smart server reply -format for the requested service name. - -Cache-Control headers SHOULD be used to disable caching of the -returned entity. - -The Content-Type MUST be `application/x-$servicename-advertisement`. -Clients SHOULD fall back to the dumb protocol if another content -type is returned. When falling back to the dumb protocol clients -SHOULD NOT make an additional request to `$GIT_URL/info/refs`, but -instead SHOULD use the response already in hand. Clients MUST NOT -continue if they do not support the dumb protocol. - -Clients MUST validate the status code is either `200 OK` or -`304 Not Modified`. - -Clients MUST validate the first five bytes of the response entity -matches the regex `^[0-9a-f]{4}#`. If this test fails, clients -MUST NOT continue. - -Clients MUST parse the entire response as a sequence of pkt-line -records. - -Clients MUST verify the first pkt-line is `# service=$servicename`. -Servers MUST set $servicename to be the request parameter value. -Servers SHOULD include an LF at the end of this line. -Clients MUST ignore an LF at the end of the line. - -Servers MUST terminate the response with the magic `0000` end -pkt-line marker. - -The returned response is a pkt-line stream describing each ref and -its known value. The stream SHOULD be sorted by name according to -the C locale ordering. The stream SHOULD include the default ref -named `HEAD` as the first ref. The stream MUST include capability -declarations behind a NUL on the first ref. - -The returned response contains "version 1" if "version=1" was sent as an -Extra Parameter. - - smart_reply = PKT-LINE("# service=$servicename" LF) - "0000" - *1("version 1") - ref_list - "0000" - ref_list = empty_list / non_empty_list - - empty_list = PKT-LINE(zero-id SP "capabilities^{}" NUL cap-list LF) - - non_empty_list = PKT-LINE(obj-id SP name NUL cap_list LF) - *ref_record - - cap-list = capability *(SP capability) - capability = 1*(LC_ALPHA / DIGIT / "-" / "_") - LC_ALPHA = %x61-7A - - ref_record = any_ref / peeled_ref - any_ref = PKT-LINE(obj-id SP name LF) - peeled_ref = PKT-LINE(obj-id SP name LF) - PKT-LINE(obj-id SP name "^{}" LF - - -Smart Service git-upload-pack ------------------------------- -This service reads from the repository pointed to by `$GIT_URL`. - -Clients MUST first perform ref discovery with -`$GIT_URL/info/refs?service=git-upload-pack`. - - C: POST $GIT_URL/git-upload-pack HTTP/1.0 - C: Content-Type: application/x-git-upload-pack-request - C: - C: 0032want 0a53e9ddeaddad63ad106860237bbf53411d11a7\n - C: 0032have 441b40d833fdfa93eb2908e52742248faf0ee993\n - C: 0000 - - S: 200 OK - S: Content-Type: application/x-git-upload-pack-result - S: Cache-Control: no-cache - S: - S: ....ACK %s, continue - S: ....NAK - -Clients MUST NOT reuse or revalidate a cached response. -Servers MUST include sufficient Cache-Control headers -to prevent caching of the response. - -Servers SHOULD support all capabilities defined here. - -Clients MUST send at least one "want" command in the request body. -Clients MUST NOT reference an id in a "want" command which did not -appear in the response obtained through ref discovery unless the -server advertises capability `allow-tip-sha1-in-want` or -`allow-reachable-sha1-in-want`. - - compute_request = want_list - have_list - request_end - request_end = "0000" / "done" - - want_list = PKT-LINE(want SP cap_list LF) - *(want_pkt) - want_pkt = PKT-LINE(want LF) - want = "want" SP id - cap_list = capability *(SP capability) - - have_list = *PKT-LINE("have" SP id LF) - -TODO: Document this further. - -The Negotiation Algorithm -~~~~~~~~~~~~~~~~~~~~~~~~~ -The computation to select the minimal pack proceeds as follows -(C = client, S = server): - -'init step:' - -C: Use ref discovery to obtain the advertised refs. - -C: Place any object seen into set `advertised`. - -C: Build an empty set, `common`, to hold the objects that are later - determined to be on both ends. - -C: Build a set, `want`, of the objects from `advertised` the client - wants to fetch, based on what it saw during ref discovery. - -C: Start a queue, `c_pending`, ordered by commit time (popping newest - first). Add all client refs. When a commit is popped from - the queue its parents SHOULD be automatically inserted back. - Commits MUST only enter the queue once. - -'one compute step:' - -C: Send one `$GIT_URL/git-upload-pack` request: - - C: 0032want <want #1>............................... - C: 0032want <want #2>............................... - .... - C: 0032have <common #1>............................. - C: 0032have <common #2>............................. - .... - C: 0032have <have #1>............................... - C: 0032have <have #2>............................... - .... - C: 0000 - -The stream is organized into "commands", with each command -appearing by itself in a pkt-line. Within a command line, -the text leading up to the first space is the command name, -and the remainder of the line to the first LF is the value. -Command lines are terminated with an LF as the last byte of -the pkt-line value. - -Commands MUST appear in the following order, if they appear -at all in the request stream: - -* "want" -* "have" - -The stream is terminated by a pkt-line flush (`0000`). - -A single "want" or "have" command MUST have one hex formatted -object name as its value. Multiple object names MUST be sent by sending -multiple commands. Object names MUST be given using the object format -negotiated through the `object-format` capability (default SHA-1). - -The `have` list is created by popping the first 32 commits -from `c_pending`. Less can be supplied if `c_pending` empties. - -If the client has sent 256 "have" commits and has not yet -received one of those back from `s_common`, or the client has -emptied `c_pending` it SHOULD include a "done" command to let -the server know it won't proceed: - - C: 0009done - -S: Parse the git-upload-pack request: - -Verify all objects in `want` are directly reachable from refs. - -The server MAY walk backwards through history or through -the reflog to permit slightly stale requests. - -If no "want" objects are received, send an error: -TODO: Define error if no "want" lines are requested. - -If any "want" object is not reachable, send an error: -TODO: Define error if an invalid "want" is requested. - -Create an empty list, `s_common`. - -If "have" was sent: - -Loop through the objects in the order supplied by the client. - -For each object, if the server has the object reachable from -a ref, add it to `s_common`. If a commit is added to `s_common`, -do not add any ancestors, even if they also appear in `have`. - -S: Send the git-upload-pack response: - -If the server has found a closed set of objects to pack or the -request ends with "done", it replies with the pack. -TODO: Document the pack based response - - S: PACK... - -The returned stream is the side-band-64k protocol supported -by the git-upload-pack service, and the pack is embedded into -stream 1. Progress messages from the server side MAY appear -in stream 2. - -Here a "closed set of objects" is defined to have at least -one path from every "want" to at least one "common" object. - -If the server needs more information, it replies with a -status continue response: -TODO: Document the non-pack response - -C: Parse the upload-pack response: - TODO: Document parsing response - -'Do another compute step.' - - -Smart Service git-receive-pack ------------------------------- -This service reads from the repository pointed to by `$GIT_URL`. - -Clients MUST first perform ref discovery with -`$GIT_URL/info/refs?service=git-receive-pack`. - - C: POST $GIT_URL/git-receive-pack HTTP/1.0 - C: Content-Type: application/x-git-receive-pack-request - C: - C: ....0a53e9ddeaddad63ad106860237bbf53411d11a7 441b40d833fdfa93eb2908e52742248faf0ee993 refs/heads/maint\0 report-status - C: 0000 - C: PACK.... - - S: 200 OK - S: Content-Type: application/x-git-receive-pack-result - S: Cache-Control: no-cache - S: - S: .... - -Clients MUST NOT reuse or revalidate a cached response. -Servers MUST include sufficient Cache-Control headers -to prevent caching of the response. - -Servers SHOULD support all capabilities defined here. - -Clients MUST send at least one command in the request body. -Within the command portion of the request body clients SHOULD send -the id obtained through ref discovery as old_id. - - update_request = command_list - "PACK" <binary data> - - command_list = PKT-LINE(command NUL cap_list LF) - *(command_pkt) - command_pkt = PKT-LINE(command LF) - cap_list = *(SP capability) SP - - command = create / delete / update - create = zero-id SP new_id SP name - delete = old_id SP zero-id SP name - update = old_id SP new_id SP name - -TODO: Document this further. - - -References ----------- - -http://www.ietf.org/rfc/rfc1738.txt[RFC 1738: Uniform Resource Locators (URL)] -http://www.ietf.org/rfc/rfc2616.txt[RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1] -link:technical/pack-protocol.html -link:technical/protocol-capabilities.html diff --git a/third_party/git/Documentation/technical/index-format.txt b/third_party/git/Documentation/technical/index-format.txt deleted file mode 100644 index f9a3644711b9..000000000000 --- a/third_party/git/Documentation/technical/index-format.txt +++ /dev/null @@ -1,359 +0,0 @@ -Git index format -================ - -== The Git index file has the following format - - All binary numbers are in network byte order. - In a repository using the traditional SHA-1, checksums and object IDs - (object names) mentioned below are all computed using SHA-1. Similarly, - in SHA-256 repositories, these values are computed using SHA-256. - Version 2 is described here unless stated otherwise. - - - A 12-byte header consisting of - - 4-byte signature: - The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache") - - 4-byte version number: - The current supported versions are 2, 3 and 4. - - 32-bit number of index entries. - - - A number of sorted index entries (see below). - - - Extensions - - Extensions are identified by signature. Optional extensions can - be ignored if Git does not understand them. - - Git currently supports cached tree and resolve undo extensions. - - 4-byte extension signature. If the first byte is 'A'..'Z' the - extension is optional and can be ignored. - - 32-bit size of the extension - - Extension data - - - Hash checksum over the content of the index file before this checksum. - -== Index entry - - Index entries are sorted in ascending order on the name field, - interpreted as a string of unsigned bytes (i.e. memcmp() order, no - localization, no special casing of directory separator '/'). Entries - with the same name are sorted by their stage field. - - 32-bit ctime seconds, the last time a file's metadata changed - this is stat(2) data - - 32-bit ctime nanosecond fractions - this is stat(2) data - - 32-bit mtime seconds, the last time a file's data changed - this is stat(2) data - - 32-bit mtime nanosecond fractions - this is stat(2) data - - 32-bit dev - this is stat(2) data - - 32-bit ino - this is stat(2) data - - 32-bit mode, split into (high to low bits) - - 4-bit object type - valid values in binary are 1000 (regular file), 1010 (symbolic link) - and 1110 (gitlink) - - 3-bit unused - - 9-bit unix permission. Only 0755 and 0644 are valid for regular files. - Symbolic links and gitlinks have value 0 in this field. - - 32-bit uid - this is stat(2) data - - 32-bit gid - this is stat(2) data - - 32-bit file size - This is the on-disk size from stat(2), truncated to 32-bit. - - Object name for the represented object - - A 16-bit 'flags' field split into (high to low bits) - - 1-bit assume-valid flag - - 1-bit extended flag (must be zero in version 2) - - 2-bit stage (during merge) - - 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF - is stored in this field. - - (Version 3 or later) A 16-bit field, only applicable if the - "extended flag" above is 1, split into (high to low bits). - - 1-bit reserved for future - - 1-bit skip-worktree flag (used by sparse checkout) - - 1-bit intent-to-add flag (used by "git add -N") - - 13-bit unused, must be zero - - Entry path name (variable length) relative to top level directory - (without leading slash). '/' is used as path separator. The special - path components ".", ".." and ".git" (without quotes) are disallowed. - Trailing slash is also disallowed. - - The exact encoding is undefined, but the '.' and '/' characters - are encoded in 7-bit ASCII and the encoding cannot contain a NUL - byte (iow, this is a UNIX pathname). - - (Version 4) In version 4, the entry path name is prefix-compressed - relative to the path name for the previous entry (the very first - entry is encoded as if the path name for the previous entry is an - empty string). At the beginning of an entry, an integer N in the - variable width encoding (the same encoding as the offset is encoded - for OFS_DELTA pack entries; see pack-format.txt) is stored, followed - by a NUL-terminated string S. Removing N bytes from the end of the - path name for the previous entry, and replacing it with the string S - yields the path name for this entry. - - 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes - while keeping the name NUL-terminated. - - (Version 4) In version 4, the padding after the pathname does not - exist. - - Interpretation of index entries in split index mode is completely - different. See below for details. - -== Extensions - -=== Cached tree - - Cached tree extension contains pre-computed hashes for trees that can - be derived from the index. It helps speed up tree object generation - from index for a new commit. - - When a path is updated in index, the path must be invalidated and - removed from tree cache. - - The signature for this extension is { 'T', 'R', 'E', 'E' }. - - A series of entries fill the entire extension; each of which - consists of: - - - NUL-terminated path component (relative to its parent directory); - - - ASCII decimal number of entries in the index that is covered by the - tree this entry represents (entry_count); - - - A space (ASCII 32); - - - ASCII decimal number that represents the number of subtrees this - tree has; - - - A newline (ASCII 10); and - - - Object name for the object that would result from writing this span - of index as a tree. - - An entry can be in an invalidated state and is represented by having - a negative number in the entry_count field. In this case, there is no - object name and the next entry starts immediately after the newline. - When writing an invalid entry, -1 should always be used as entry_count. - - The entries are written out in the top-down, depth-first order. The - first entry represents the root level of the repository, followed by the - first subtree--let's call this A--of the root level (with its name - relative to the root level), followed by the first subtree of A (with - its name relative to A), ... - -=== Resolve undo - - A conflict is represented in the index as a set of higher stage entries. - When a conflict is resolved (e.g. with "git add path"), these higher - stage entries will be removed and a stage-0 entry with proper resolution - is added. - - When these higher stage entries are removed, they are saved in the - resolve undo extension, so that conflicts can be recreated (e.g. with - "git checkout -m"), in case users want to redo a conflict resolution - from scratch. - - The signature for this extension is { 'R', 'E', 'U', 'C' }. - - A series of entries fill the entire extension; each of which - consists of: - - - NUL-terminated pathname the entry describes (relative to the root of - the repository, i.e. full pathname); - - - Three NUL-terminated ASCII octal numbers, entry mode of entries in - stage 1 to 3 (a missing stage is represented by "0" in this field); - and - - - At most three object names of the entry in stages from 1 to 3 - (nothing is written for a missing stage). - -=== Split index - - In split index mode, the majority of index entries could be stored - in a separate file. This extension records the changes to be made on - top of that to produce the final index. - - The signature for this extension is { 'l', 'i', 'n', 'k' }. - - The extension consists of: - - - Hash of the shared index file. The shared index file path - is $GIT_DIR/sharedindex.<hash>. If all bits are zero, the - index does not require a shared index file. - - - An ewah-encoded delete bitmap, each bit represents an entry in the - shared index. If a bit is set, its corresponding entry in the - shared index will be removed from the final index. Note, because - a delete operation changes index entry positions, but we do need - original positions in replace phase, it's best to just mark - entries for removal, then do a mass deletion after replacement. - - - An ewah-encoded replace bitmap, each bit represents an entry in - the shared index. If a bit is set, its corresponding entry in the - shared index will be replaced with an entry in this index - file. All replaced entries are stored in sorted order in this - index. The first "1" bit in the replace bitmap corresponds to the - first index entry, the second "1" bit to the second entry and so - on. Replaced entries may have empty path names to save space. - - The remaining index entries after replaced ones will be added to the - final index. These added entries are also sorted by entry name then - stage. - -== Untracked cache - - Untracked cache saves the untracked file list and necessary data to - verify the cache. The signature for this extension is { 'U', 'N', - 'T', 'R' }. - - The extension starts with - - - A sequence of NUL-terminated strings, preceded by the size of the - sequence in variable width encoding. Each string describes the - environment where the cache can be used. - - - Stat data of $GIT_DIR/info/exclude. See "Index entry" section from - ctime field until "file size". - - - Stat data of core.excludesfile - - - 32-bit dir_flags (see struct dir_struct) - - - Hash of $GIT_DIR/info/exclude. A null hash means the file - does not exist. - - - Hash of core.excludesfile. A null hash means the file does - not exist. - - - NUL-terminated string of per-dir exclude file name. This usually - is ".gitignore". - - - The number of following directory blocks, variable width - encoding. If this number is zero, the extension ends here with a - following NUL. - - - A number of directory blocks in depth-first-search order, each - consists of - - - The number of untracked entries, variable width encoding. - - - The number of sub-directory blocks, variable width encoding. - - - The directory name terminated by NUL. - - - A number of untracked file/dir names terminated by NUL. - -The remaining data of each directory block is grouped by type: - - - An ewah bitmap, the n-th bit marks whether the n-th directory has - valid untracked cache entries. - - - An ewah bitmap, the n-th bit records "check-only" bit of - read_directory_recursive() for the n-th directory. - - - An ewah bitmap, the n-th bit indicates whether hash and stat data - is valid for the n-th directory and exists in the next data. - - - An array of stat data. The n-th data corresponds with the n-th - "one" bit in the previous ewah bitmap. - - - An array of hashes. The n-th hash corresponds with the n-th "one" bit - in the previous ewah bitmap. - - - One NUL. - -== File System Monitor cache - - The file system monitor cache tracks files for which the core.fsmonitor - hook has told us about changes. The signature for this extension is - { 'F', 'S', 'M', 'N' }. - - The extension starts with - - - 32-bit version number: the current supported version is 1. - - - 64-bit time: the extension data reflects all changes through the given - time which is stored as the nanoseconds elapsed since midnight, - January 1, 1970. - - - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap. - - - An ewah bitmap, the n-th bit indicates whether the n-th index entry - is not CE_FSMONITOR_VALID. - -== End of Index Entry - - The End of Index Entry (EOIE) is used to locate the end of the variable - length index entries and the beginning of the extensions. Code can take - advantage of this to quickly locate the index extensions without having - to parse through all of the index entries. - - Because it must be able to be loaded before the variable length cache - entries and other index extensions, this extension must be written last. - The signature for this extension is { 'E', 'O', 'I', 'E' }. - - The extension consists of: - - - 32-bit offset to the end of the index entries - - - Hash over the extension types and their sizes (but not - their contents). E.g. if we have "TREE" extension that is N-bytes - long, "REUC" extension that is M-bytes long, followed by "EOIE", - then the hash would be: - - Hash("TREE" + <binary representation of N> + - "REUC" + <binary representation of M>) - -== Index Entry Offset Table - - The Index Entry Offset Table (IEOT) is used to help address the CPU - cost of loading the index by enabling multi-threading the process of - converting cache entries from the on-disk format to the in-memory format. - The signature for this extension is { 'I', 'E', 'O', 'T' }. - - The extension consists of: - - - 32-bit version (currently 1) - - - A number of index offset entries each consisting of: - - - 32-bit offset from the beginning of the file to the first cache entry - in this block of entries. - - - 32-bit count of cache entries in this block diff --git a/third_party/git/Documentation/technical/long-running-process-protocol.txt b/third_party/git/Documentation/technical/long-running-process-protocol.txt deleted file mode 100644 index aa0aa9af1c2e..000000000000 --- a/third_party/git/Documentation/technical/long-running-process-protocol.txt +++ /dev/null @@ -1,50 +0,0 @@ -Long-running process protocol -============================= - -This protocol is used when Git needs to communicate with an external -process throughout the entire life of a single Git command. All -communication is in pkt-line format (see technical/protocol-common.txt) -over standard input and standard output. - -Handshake ---------- - -Git starts by sending a welcome message (for example, -"git-filter-client"), a list of supported protocol version numbers, and -a flush packet. Git expects to read the welcome message with "server" -instead of "client" (for example, "git-filter-server"), exactly one -protocol version number from the previously sent list, and a flush -packet. All further communication will be based on the selected version. -The remaining protocol description below documents "version=2". Please -note that "version=42" in the example below does not exist and is only -there to illustrate how the protocol would look like with more than one -version. - -After the version negotiation Git sends a list of all capabilities that -it supports and a flush packet. Git expects to read a list of desired -capabilities, which must be a subset of the supported capabilities list, -and a flush packet as response: ------------------------- -packet: git> git-filter-client -packet: git> version=2 -packet: git> version=42 -packet: git> 0000 -packet: git< git-filter-server -packet: git< version=2 -packet: git< 0000 -packet: git> capability=clean -packet: git> capability=smudge -packet: git> capability=not-yet-invented -packet: git> 0000 -packet: git< capability=clean -packet: git< capability=smudge -packet: git< 0000 ------------------------- - -Shutdown --------- - -Git will close -the command pipe on exit. The filter is expected to detect EOF -and exit gracefully on its own. Git will wait until the filter -process has stopped. diff --git a/third_party/git/Documentation/technical/multi-pack-index.txt b/third_party/git/Documentation/technical/multi-pack-index.txt deleted file mode 100644 index 4e7631437a58..000000000000 --- a/third_party/git/Documentation/technical/multi-pack-index.txt +++ /dev/null @@ -1,109 +0,0 @@ -Multi-Pack-Index (MIDX) Design Notes -==================================== - -The Git object directory contains a 'pack' directory containing -packfiles (with suffix ".pack") and pack-indexes (with suffix -".idx"). The pack-indexes provide a way to lookup objects and -navigate to their offset within the pack, but these must come -in pairs with the packfiles. This pairing depends on the file -names, as the pack-index differs only in suffix with its pack- -file. While the pack-indexes provide fast lookup per packfile, -this performance degrades as the number of packfiles increases, -because abbreviations need to inspect every packfile and we are -more likely to have a miss on our most-recently-used packfile. -For some large repositories, repacking into a single packfile -is not feasible due to storage space or excessive repack times. - -The multi-pack-index (MIDX for short) stores a list of objects -and their offsets into multiple packfiles. It contains: - -- A list of packfile names. -- A sorted list of object IDs. -- A list of metadata for the ith object ID including: - - A value j referring to the jth packfile. - - An offset within the jth packfile for the object. -- If large offsets are required, we use another list of large - offsets similar to version 2 pack-indexes. - -Thus, we can provide O(log N) lookup time for any number -of packfiles. - -Design Details --------------- - -- The MIDX is stored in a file named 'multi-pack-index' in the - .git/objects/pack directory. This could be stored in the pack - directory of an alternate. It refers only to packfiles in that - same directory. - -- The core.multiPackIndex config setting must be on to consume MIDX files. - -- The file format includes parameters for the object ID hash - function, so a future change of hash algorithm does not require - a change in format. - -- The MIDX keeps only one record per object ID. If an object appears - in multiple packfiles, then the MIDX selects the copy in the most- - recently modified packfile. - -- If there exist packfiles in the pack directory not registered in - the MIDX, then those packfiles are loaded into the `packed_git` - list and `packed_git_mru` cache. - -- The pack-indexes (.idx files) remain in the pack directory so we - can delete the MIDX file, set core.midx to false, or downgrade - without any loss of information. - -- The MIDX file format uses a chunk-based approach (similar to the - commit-graph file) that allows optional data to be added. - -Future Work ------------ - -- Add a 'verify' subcommand to the 'git midx' builtin to verify the - contents of the multi-pack-index file match the offsets listed in - the corresponding pack-indexes. - -- The multi-pack-index allows many packfiles, especially in a context - where repacking is expensive (such as a very large repo), or - unexpected maintenance time is unacceptable (such as a high-demand - build machine). However, the multi-pack-index needs to be rewritten - in full every time. We can extend the format to be incremental, so - writes are fast. By storing a small "tip" multi-pack-index that - points to large "base" MIDX files, we can keep writes fast while - still reducing the number of binary searches required for object - lookups. - -- The reachability bitmap is currently paired directly with a single - packfile, using the pack-order as the object order to hopefully - compress the bitmaps well using run-length encoding. This could be - extended to pair a reachability bitmap with a multi-pack-index. If - the multi-pack-index is extended to store a "stable object order" - (a function Order(hash) = integer that is constant for a given hash, - even as the multi-pack-index is updated) then a reachability bitmap - could point to a multi-pack-index and be updated independently. - -- Packfiles can be marked as "special" using empty files that share - the initial name but replace ".pack" with ".keep" or ".promisor". - We can add an optional chunk of data to the multi-pack-index that - records flags of information about the packfiles. This allows new - states, such as 'repacked' or 'redeltified', that can help with - pack maintenance in a multi-pack environment. It may also be - helpful to organize packfiles by object type (commit, tree, blob, - etc.) and use this metadata to help that maintenance. - -- The partial clone feature records special "promisor" packs that - may point to objects that are not stored locally, but available - on request to a server. The multi-pack-index does not currently - track these promisor packs. - -Related Links -------------- -[0] https://bugs.chromium.org/p/git/issues/detail?id=6 - Chromium work item for: Multi-Pack Index (MIDX) - -[1] https://lore.kernel.org/git/20180107181459.222909-1-dstolee@microsoft.com/ - An earlier RFC for the multi-pack-index feature - -[2] https://lore.kernel.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ - Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) diff --git a/third_party/git/Documentation/technical/pack-format.txt b/third_party/git/Documentation/technical/pack-format.txt deleted file mode 100644 index f96b2e605f34..000000000000 --- a/third_party/git/Documentation/technical/pack-format.txt +++ /dev/null @@ -1,343 +0,0 @@ -Git pack format -=============== - -== Checksums and object IDs - -In a repository using the traditional SHA-1, pack checksums, index checksums, -and object IDs (object names) mentioned below are all computed using SHA-1. -Similarly, in SHA-256 repositories, these values are computed using SHA-256. - -== pack-*.pack files have the following format: - - - A header appears at the beginning and consists of the following: - - 4-byte signature: - The signature is: {'P', 'A', 'C', 'K'} - - 4-byte version number (network byte order): - Git currently accepts version number 2 or 3 but - generates version 2 only. - - 4-byte number of objects contained in the pack (network byte order) - - Observation: we cannot have more than 4G versions ;-) and - more than 4G objects in a pack. - - - The header is followed by number of object entries, each of - which looks like this: - - (undeltified representation) - n-byte type and length (3-bit type, (n-1)*7+4-bit length) - compressed data - - (deltified representation) - n-byte type and length (3-bit type, (n-1)*7+4-bit length) - base object name if OBJ_REF_DELTA or a negative relative - offset from the delta object's position in the pack if this - is an OBJ_OFS_DELTA object - compressed delta data - - Observation: length of each object is encoded in a variable - length format and is not constrained to 32-bit or anything. - - - The trailer records a pack checksum of all of the above. - -=== Object types - -Valid object types are: - -- OBJ_COMMIT (1) -- OBJ_TREE (2) -- OBJ_BLOB (3) -- OBJ_TAG (4) -- OBJ_OFS_DELTA (6) -- OBJ_REF_DELTA (7) - -Type 5 is reserved for future expansion. Type 0 is invalid. - -=== Deltified representation - -Conceptually there are only four object types: commit, tree, tag and -blob. However to save space, an object could be stored as a "delta" of -another "base" object. These representations are assigned new types -ofs-delta and ref-delta, which is only valid in a pack file. - -Both ofs-delta and ref-delta store the "delta" to be applied to -another object (called 'base object') to reconstruct the object. The -difference between them is, ref-delta directly encodes base object -name. If the base object is in the same pack, ofs-delta encodes -the offset of the base object in the pack instead. - -The base object could also be deltified if it's in the same pack. -Ref-delta can also refer to an object outside the pack (i.e. the -so-called "thin pack"). When stored on disk however, the pack should -be self contained to avoid cyclic dependency. - -The delta data is a sequence of instructions to reconstruct an object -from the base object. If the base object is deltified, it must be -converted to canonical form first. Each instruction appends more and -more data to the target object until it's complete. There are two -supported instructions so far: one for copy a byte range from the -source object and one for inserting new data embedded in the -instruction itself. - -Each instruction has variable length. Instruction type is determined -by the seventh bit of the first octet. The following diagrams follow -the convention in RFC 1951 (Deflate compressed data format). - -==== Instruction to copy from base object - - +----------+---------+---------+---------+---------+-------+-------+-------+ - | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | - +----------+---------+---------+---------+---------+-------+-------+-------+ - -This is the instruction format to copy a byte range from the source -object. It encodes the offset to copy from and the number of bytes to -copy. Offset and size are in little-endian order. - -All offset and size bytes are optional. This is to reduce the -instruction size when encoding small offsets or sizes. The first seven -bits in the first octet determines which of the next seven octets is -present. If bit zero is set, offset1 is present. If bit one is set -offset2 is present and so on. - -Note that a more compact instruction does not change offset and size -encoding. For example, if only offset2 is omitted like below, offset3 -still contains bits 16-23. It does not become offset2 and contains -bits 8-15 even if it's right next to offset1. - - +----------+---------+---------+ - | 10000101 | offset1 | offset3 | - +----------+---------+---------+ - -In its most compact form, this instruction only takes up one byte -(0x80) with both offset and size omitted, which will have default -values zero. There is another exception: size zero is automatically -converted to 0x10000. - -==== Instruction to add new data - - +----------+============+ - | 0xxxxxxx | data | - +----------+============+ - -This is the instruction to construct target object without the base -object. The following data is appended to the target object. The first -seven bits of the first octet determines the size of data in -bytes. The size must be non-zero. - -==== Reserved instruction - - +----------+============ - | 00000000 | - +----------+============ - -This is the instruction reserved for future expansion. - -== Original (version 1) pack-*.idx files have the following format: - - - The header consists of 256 4-byte network byte order - integers. N-th entry of this table records the number of - objects in the corresponding pack, the first byte of whose - object name is less than or equal to N. This is called the - 'first-level fan-out' table. - - - The header is followed by sorted 24-byte entries, one entry - per object in the pack. Each entry is: - - 4-byte network byte order integer, recording where the - object is stored in the packfile as the offset from the - beginning. - - one object name of the appropriate size. - - - The file is concluded with a trailer: - - A copy of the pack checksum at the end of the corresponding - packfile. - - Index checksum of all of the above. - -Pack Idx file: - - -- +--------------------------------+ -fanout | fanout[0] = 2 (for example) |-. -table +--------------------------------+ | - | fanout[1] | | - +--------------------------------+ | - | fanout[2] | | - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | - | fanout[255] = total objects |---. - -- +--------------------------------+ | | -main | offset | | | -index | object name 00XXXXXXXXXXXXXXXX | | | -table +--------------------------------+ | | - | offset | | | - | object name 00XXXXXXXXXXXXXXXX | | | - +--------------------------------+<+ | - .-| offset | | - | | object name 01XXXXXXXXXXXXXXXX | | - | +--------------------------------+ | - | | offset | | - | | object name 01XXXXXXXXXXXXXXXX | | - | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | - | | offset | | - | | object name FFXXXXXXXXXXXXXXXX | | - --| +--------------------------------+<--+ -trailer | | packfile checksum | - | +--------------------------------+ - | | idxfile checksum | - | +--------------------------------+ - .-------. - | -Pack file entry: <+ - - packed object header: - 1-byte size extension bit (MSB) - type (next 3 bit) - size0 (lower 4-bit) - n-byte sizeN (as long as MSB is set, each 7-bit) - size0..sizeN form 4+7+7+..+7 bit integer, size0 - is the least significant part, and sizeN is the - most significant part. - packed object data: - If it is not DELTA, then deflated bytes (the size above - is the size before compression). - If it is REF_DELTA, then - base object name (the size above is the - size of the delta data that follows). - delta data, deflated. - If it is OFS_DELTA, then - n-byte offset (see below) interpreted as a negative - offset from the type-byte of the header of the - ofs-delta entry (the size above is the size of - the delta data that follows). - delta data, deflated. - - offset encoding: - n bytes with MSB set in all but the last one. - The offset is then the number constructed by - concatenating the lower 7 bit of each byte, and - for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1)) - to the result. - - - -== Version 2 pack-*.idx files support packs larger than 4 GiB, and - have some other reorganizations. They have the format: - - - A 4-byte magic number '\377tOc' which is an unreasonable - fanout[0] value. - - - A 4-byte version number (= 2) - - - A 256-entry fan-out table just like v1. - - - A table of sorted object names. These are packed together - without offset values to reduce the cache footprint of the - binary search for a specific object name. - - - A table of 4-byte CRC32 values of the packed object data. - This is new in v2 so compressed data can be copied directly - from pack to pack during repacking without undetected - data corruption. - - - A table of 4-byte offset values (in network byte order). - These are usually 31-bit pack file offsets, but large - offsets are encoded as an index into the next table with - the msbit set. - - - A table of 8-byte offset entries (empty for pack files less - than 2 GiB). Pack files are organized with heavily used - objects toward the front, so most object references should - not need to refer to this table. - - - The same trailer as a v1 pack file: - - A copy of the pack checksum at the end of - corresponding packfile. - - Index checksum of all of the above. - -== multi-pack-index (MIDX) files have the following format: - -The multi-pack-index files refer to multiple pack-files and loose objects. - -In order to allow extensions that add extra data to the MIDX, we organize -the body into "chunks" and provide a lookup table at the beginning of the -body. The header includes certain length values, such as the number of packs, -the number of base MIDX files, hash lengths and types. - -All 4-byte numbers are in network order. - -HEADER: - - 4-byte signature: - The signature is: {'M', 'I', 'D', 'X'} - - 1-byte version number: - Git only writes or recognizes version 1. - - 1-byte Object Id Version - We infer the length of object IDs (OIDs) from this value: - 1 => SHA-1 - 2 => SHA-256 - If the hash type does not match the repository's hash algorithm, - the multi-pack-index file should be ignored with a warning - presented to the user. - - 1-byte number of "chunks" - - 1-byte number of base multi-pack-index files: - This value is currently always zero. - - 4-byte number of pack files - -CHUNK LOOKUP: - - (C + 1) * 12 bytes providing the chunk offsets: - First 4 bytes describe chunk id. Value 0 is a terminating label. - Other 8 bytes provide offset in current file for chunk to start. - (Chunks are provided in file-order, so you can infer the length - using the next chunk position if necessary.) - - The remaining data in the body is described one chunk at a time, and - these chunks may be given in any order. Chunks are required unless - otherwise specified. - -CHUNK DATA: - - Packfile Names (ID: {'P', 'N', 'A', 'M'}) - Stores the packfile names as concatenated, null-terminated strings. - Packfiles must be listed in lexicographic order for fast lookups by - name. This is the only chunk not guaranteed to be a multiple of four - bytes in length, so should be the last chunk for alignment reasons. - - OID Fanout (ID: {'O', 'I', 'D', 'F'}) - The ith entry, F[i], stores the number of OIDs with first - byte at most i. Thus F[255] stores the total - number of objects. - - OID Lookup (ID: {'O', 'I', 'D', 'L'}) - The OIDs for all objects in the MIDX are stored in lexicographic - order in this chunk. - - Object Offsets (ID: {'O', 'O', 'F', 'F'}) - Stores two 4-byte values for every object. - 1: The pack-int-id for the pack storing this object. - 2: The offset within the pack. - If all offsets are less than 2^32, then the large offset chunk - will not exist and offsets are stored as in IDX v1. - If there is at least one offset value larger than 2^32-1, then - the large offset chunk must exist, and offsets larger than - 2^31-1 must be stored in it instead. If the large offset chunk - exists and the 31st bit is on, then removing that bit reveals - the row in the large offsets containing the 8-byte offset of - this object. - - [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) - 8-byte offsets into large packfiles. - -TRAILER: - - Index checksum of the above contents. diff --git a/third_party/git/Documentation/technical/pack-heuristics.txt b/third_party/git/Documentation/technical/pack-heuristics.txt deleted file mode 100644 index 95a07db6e82b..000000000000 --- a/third_party/git/Documentation/technical/pack-heuristics.txt +++ /dev/null @@ -1,460 +0,0 @@ -Concerning Git's Packing Heuristics -=================================== - - Oh, here's a really stupid question: - - Where do I go - to learn the details - of Git's packing heuristics? - -Be careful what you ask! - -Followers of the Git, please open the Git IRC Log and turn to -February 10, 2006. - -It's a rare occasion, and we are joined by the King Git Himself, -Linus Torvalds (linus). Nathaniel Smith, (njs`), has the floor -and seeks enlightenment. Others are present, but silent. - -Let's listen in! - - <njs`> Oh, here's a really stupid question -- where do I go to - learn the details of Git's packing heuristics? google avails - me not, reading the source didn't help a lot, and wading - through the whole mailing list seems less efficient than any - of that. - -It is a bold start! A plea for help combined with a simultaneous -tri-part attack on some of the tried and true mainstays in the quest -for enlightenment. Brash accusations of google being useless. Hubris! -Maligning the source. Heresy! Disdain for the mailing list archives. -Woe. - - <pasky> yes, the packing-related delta stuff is somewhat - mysterious even for me ;) - -Ah! Modesty after all. - - <linus> njs, I don't think the docs exist. That's something where - I don't think anybody else than me even really got involved. - Most of the rest of Git others have been busy with (especially - Junio), but packing nobody touched after I did it. - -It's cryptic, yet vague. Linus in style for sure. Wise men -interpret this as an apology. A few argue it is merely a -statement of fact. - - <njs`> I guess the next step is "read the source again", but I - have to build up a certain level of gumption first :-) - -Indeed! On both points. - - <linus> The packing heuristic is actually really really simple. - -Bait... - - <linus> But strange. - -And switch. That ought to do it! - - <linus> Remember: Git really doesn't follow files. So what it does is - - generate a list of all objects - - sort the list according to magic heuristics - - walk the list, using a sliding window, seeing if an object - can be diffed against another object in the window - - write out the list in recency order - -The traditional understatement: - - <njs`> I suspect that what I'm missing is the precise definition of - the word "magic" - -The traditional insight: - - <pasky> yes - -And Babel-like confusion flowed. - - <njs`> oh, hmm, and I'm not sure what this sliding window means either - - <pasky> iirc, it appeared to me to be just the sha1 of the object - when reading the code casually ... - - ... which simply doesn't sound as a very good heuristics, though ;) - - <njs`> .....and recency order. okay, I think it's clear I didn't - even realize how much I wasn't realizing :-) - -Ah, grasshopper! And thus the enlightenment begins anew. - - <linus> The "magic" is actually in theory totally arbitrary. - ANY order will give you a working pack, but no, it's not - ordered by SHA-1. - - Before talking about the ordering for the sliding delta - window, let's talk about the recency order. That's more - important in one way. - - <njs`> Right, but if all you want is a working way to pack things - together, you could just use cat and save yourself some - trouble... - -Waaait for it.... - - <linus> The recency ordering (which is basically: put objects - _physically_ into the pack in the order that they are - "reachable" from the head) is important. - - <njs`> okay - - <linus> It's important because that's the thing that gives packs - good locality. It keeps the objects close to the head (whether - they are old or new, but they are _reachable_ from the head) - at the head of the pack. So packs actually have absolutely - _wonderful_ IO patterns. - -Read that again, because it is important. - - <linus> But recency ordering is totally useless for deciding how - to actually generate the deltas, so the delta ordering is - something else. - - The delta ordering is (wait for it): - - first sort by the "basename" of the object, as defined by - the name the object was _first_ reached through when - generating the object list - - within the same basename, sort by size of the object - - but always sort different types separately (commits first). - - That's not exactly it, but it's very close. - - <njs`> The "_first_ reached" thing is not too important, just you - need some way to break ties since the same objects may be - reachable many ways, yes? - -And as if to clarify: - - <linus> The point is that it's all really just any random - heuristic, and the ordering is totally unimportant for - correctness, but it helps a lot if the heuristic gives - "clumping" for things that are likely to delta well against - each other. - -It is an important point, so secretly, I did my own research and have -included my results below. To be fair, it has changed some over time. -And through the magic of Revisionistic History, I draw upon this entry -from The Git IRC Logs on my father's birthday, March 1: - - <gitster> The quote from the above linus should be rewritten a - bit (wait for it): - - first sort by type. Different objects never delta with - each other. - - then sort by filename/dirname. hash of the basename - occupies the top BITS_PER_INT-DIR_BITS bits, and bottom - DIR_BITS are for the hash of leading path elements. - - then if we are doing "thin" pack, the objects we are _not_ - going to pack but we know about are sorted earlier than - other objects. - - and finally sort by size, larger to smaller. - -In one swell-foop, clarification and obscurification! Nonetheless, -authoritative. Cryptic, yet concise. It even solicits notions of -quotes from The Source Code. Clearly, more study is needed. - - <gitster> That's the sort order. What this means is: - - we do not delta different object types. - - we prefer to delta the objects with the same full path, but - allow files with the same name from different directories. - - we always prefer to delta against objects we are not going - to send, if there are some. - - we prefer to delta against larger objects, so that we have - lots of removals. - - The penultimate rule is for "thin" packs. It is used when - the other side is known to have such objects. - -There it is again. "Thin" packs. I'm thinking to myself, "What -is a 'thin' pack?" So I ask: - - <jdl> What is a "thin" pack? - - <gitster> Use of --objects-edge to rev-list as the upstream of - pack-objects. The pack transfer protocol negotiates that. - -Woo hoo! Cleared that _right_ up! - - <gitster> There are two directions - push and fetch. - -There! Did you see it? It is not '"push" and "pull"'! How often the -confusion has started here. So casually mentioned, too! - - <gitster> For push, git-send-pack invokes git-receive-pack on the - other end. The receive-pack says "I have up to these commits". - send-pack looks at them, and computes what are missing from - the other end. So "thin" could be the default there. - - In the other direction, fetch, git-fetch-pack and - git-clone-pack invokes git-upload-pack on the other end - (via ssh or by talking to the daemon). - - There are two cases: fetch-pack with -k and clone-pack is one, - fetch-pack without -k is the other. clone-pack and fetch-pack - with -k will keep the downloaded packfile without expanded, so - we do not use thin pack transfer. Otherwise, the generated - pack will have delta without base object in the same pack. - - But fetch-pack without -k will explode the received pack into - individual objects, so we automatically ask upload-pack to - give us a thin pack if upload-pack supports it. - -OK then. - -Uh. - -Let's return to the previous conversation still in progress. - - <njs`> and "basename" means something like "the tail of end of - path of file objects and dir objects, as per basename(3), and - we just declare all commit and tag objects to have the same - basename" or something? - -Luckily, that too is a point that gitster clarified for us! - -If I might add, the trick is to make files that _might_ be similar be -located close to each other in the hash buckets based on their file -names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and -"Makefile" all landed in the same bucket due to their common basename, -"Makefile". However, now they land in "close" buckets. - -The algorithm allows not just for the _same_ bucket, but for _close_ -buckets to be considered delta candidates. The rationale is -essentially that files, like Makefiles, often have very similar -content no matter what directory they live in. - - <linus> I played around with different delta algorithms, and with - making the "delta window" bigger, but having too big of a - sliding window makes it very expensive to generate the pack: - you need to compare every object with a _ton_ of other objects. - - There are a number of other trivial heuristics too, which - basically boil down to "don't bother even trying to delta this - pair" if we can tell before-hand that the delta isn't worth it - (due to size differences, where we can take a previous delta - result into account to decide that "ok, no point in trying - that one, it will be worse"). - - End result: packing is actually very size efficient. It's - somewhat CPU-wasteful, but on the other hand, since you're - really only supposed to do it maybe once a month (and you can - do it during the night), nobody really seems to care. - -Nice Engineering Touch, there. Find when it doesn't matter, and -proclaim it a non-issue. Good style too! - - <njs`> So, just to repeat to see if I'm following, we start by - getting a list of the objects we want to pack, we sort it by - this heuristic (basically lexicographically on the tuple - (type, basename, size)). - - Then we walk through this list, and calculate a delta of - each object against the last n (tunable parameter) objects, - and pick the smallest of these deltas. - -Vastly simplified, but the essence is there! - - <linus> Correct. - - <njs`> And then once we have picked a delta or fulltext to - represent each object, we re-sort by recency, and write them - out in that order. - - <linus> Yup. Some other small details: - -And of course there is the "Other Shoe" Factor too. - - <linus> - We limit the delta depth to another magic value (right - now both the window and delta depth magic values are just "10") - - <njs`> Hrm, my intuition is that you'd end up with really _bad_ IO - patterns, because the things you want are near by, but to - actually reconstruct them you may have to jump all over in - random ways. - - <linus> - When we write out a delta, and we haven't yet written - out the object it is a delta against, we write out the base - object first. And no, when we reconstruct them, we actually - get nice IO patterns, because: - - larger objects tend to be "more recent" (Linus' law: files grow) - - we actively try to generate deltas from a larger object to a - smaller one - - this means that the top-of-tree very seldom has deltas - (i.e. deltas in _practice_ are "backwards deltas") - -Again, we should reread that whole paragraph. Not just because -Linus has slipped Linus's Law in there on us, but because it is -important. Let's make sure we clarify some of the points here: - - <njs`> So the point is just that in practice, delta order and - recency order match each other quite well. - - <linus> Yes. There's another nice side to this (and yes, it was - designed that way ;): - - the reason we generate deltas against the larger object is - actually a big space saver too! - - <njs`> Hmm, but your last comment (if "we haven't yet written out - the object it is a delta against, we write out the base object - first"), seems like it would make these facts mostly - irrelevant because even if in practice you would not have to - wander around much, in fact you just brute-force say that in - the cases where you might have to wander, don't do that :-) - - <linus> Yes and no. Notice the rule: we only write out the base - object first if the delta against it was more recent. That - means that you can actually have deltas that refer to a base - object that is _not_ close to the delta object, but that only - happens when the delta is needed to generate an _old_ object. - - <linus> See? - -Yeah, no. I missed that on the first two or three readings myself. - - <linus> This keeps the front of the pack dense. The front of the - pack never contains data that isn't relevant to a "recent" - object. The size optimization comes from our use of xdelta - (but is true for many other delta algorithms): removing data - is cheaper (in size) than adding data. - - When you remove data, you only need to say "copy bytes n--m". - In contrast, in a delta that _adds_ data, you have to say "add - these bytes: 'actual data goes here'" - - *** njs` has quit: Read error: 104 (Connection reset by peer) - - <linus> Uhhuh. I hope I didn't blow njs` mind. - - *** njs` has joined channel #git - - <pasky> :) - -The silent observers are amused. Of course. - -And as if njs` was expected to be omniscient: - - <linus> njs - did you miss anything? - -OK, I'll spell it out. That's Geek Humor. If njs` was not actually -connected for a little bit there, how would he know if missed anything -while he was disconnected? He's a benevolent dictator with a sense of -humor! Well noted! - - <njs`> Stupid router. Or gremlins, or whatever. - -It's a cheap shot at Cisco. Take 'em when you can. - - <njs`> Yes and no. Notice the rule: we only write out the base - object first if the delta against it was more recent. - - I'm getting lost in all these orders, let me re-read :-) - So the write-out order is from most recent to least recent? - (Conceivably it could be the opposite way too, I'm not sure if - we've said) though my connection back at home is logging, so I - can just read what you said there :-) - -And for those of you paying attention, the Omniscient Trick has just -been detailed! - - <linus> Yes, we always write out most recent first - - <njs`> And, yeah, I got the part about deeper-in-history stuff - having worse IO characteristics, one sort of doesn't care. - - <linus> With the caveat that if the "most recent" needs an older - object to delta against (hey, shrinking sometimes does - happen), we write out the old object with the delta. - - <njs`> (if only it happened more...) - - <linus> Anyway, the pack-file could easily be denser still, but - because it's used both for streaming (the Git protocol) and - for on-disk, it has a few pessimizations. - -Actually, it is a made-up word. But it is a made-up word being -used as setup for a later optimization, which is a real word: - - <linus> In particular, while the pack-file is then compressed, - it's compressed just one object at a time, so the actual - compression factor is less than it could be in theory. But it - means that it's all nice random-access with a simple index to - do "object name->location in packfile" translation. - - <njs`> I'm assuming the real win for delta-ing large->small is - more homogeneous statistics for gzip to run over? - - (You have to put the bytes in one place or another, but - putting them in a larger blob wins on compression) - - Actually, what is the compression strategy -- each delta - individually gzipped, the whole file gzipped, somewhere in - between, no compression at all, ....? - - Right. - -Reality IRC sets in. For example: - - <pasky> I'll read the rest in the morning, I really have to go - sleep or there's no hope whatsoever for me at the today's - exam... g'nite all. - -Heh. - - <linus> pasky: g'nite - - <njs`> pasky: 'luck - - <linus> Right: large->small matters exactly because of compression - behaviour. If it was non-compressed, it probably wouldn't make - any difference. - - <njs`> yeah - - <linus> Anyway: I'm not even trying to claim that the pack-files - are perfect, but they do tend to have a nice balance of - density vs ease-of use. - -Gasp! OK, saved. That's a fair Engineering trade off. Close call! -In fact, Linus reflects on some Basic Engineering Fundamentals, -design options, etc. - - <linus> More importantly, they allow Git to still _conceptually_ - never deal with deltas at all, and be a "whole object" store. - - Which has some problems (we discussed bad huge-file - behaviour on the Git lists the other day), but it does mean - that the basic Git concepts are really really simple and - straightforward. - - It's all been quite stable. - - Which I think is very much a result of having very simple - basic ideas, so that there's never any confusion about what's - going on. - - Bugs happen, but they are "simple" bugs. And bugs that - actually get some object store detail wrong are almost always - so obvious that they never go anywhere. - - <njs`> Yeah. - -Nuff said. - - <linus> Anyway. I'm off for bed. It's not 6AM here, but I've got - three kids, and have to get up early in the morning to send - them off. I need my beauty sleep. - - <njs`> :-) - - <njs`> appreciate the infodump, I really was failing to find the - details on Git packs :-) - -And now you know the rest of the story. diff --git a/third_party/git/Documentation/technical/pack-protocol.txt b/third_party/git/Documentation/technical/pack-protocol.txt deleted file mode 100644 index e13a2c064d12..000000000000 --- a/third_party/git/Documentation/technical/pack-protocol.txt +++ /dev/null @@ -1,709 +0,0 @@ -Packfile transfer protocols -=========================== - -Git supports transferring data in packfiles over the ssh://, git://, http:// and -file:// transports. There exist two sets of protocols, one for pushing -data from a client to a server and another for fetching data from a -server to a client. The three transports (ssh, git, file) use the same -protocol to transfer data. http is documented in http-protocol.txt. - -The processes invoked in the canonical Git implementation are 'upload-pack' -on the server side and 'fetch-pack' on the client side for fetching data; -then 'receive-pack' on the server and 'send-pack' on the client for pushing -data. The protocol functions to have a server tell a client what is -currently on the server, then for the two to negotiate the smallest amount -of data to send in order to fully update one or the other. - -pkt-line Format ---------------- - -The descriptions below build on the pkt-line format described in -protocol-common.txt. When the grammar indicate `PKT-LINE(...)`, unless -otherwise noted the usual pkt-line LF rules apply: the sender SHOULD -include a LF, but the receiver MUST NOT complain if it is not present. - -An error packet is a special pkt-line that contains an error string. - ----- - error-line = PKT-LINE("ERR" SP explanation-text) ----- - -Throughout the protocol, where `PKT-LINE(...)` is expected, an error packet MAY -be sent. Once this packet is sent by a client or a server, the data transfer -process defined in this protocol is terminated. - -Transports ----------- -There are three transports over which the packfile protocol is -initiated. The Git transport is a simple, unauthenticated server that -takes the command (almost always 'upload-pack', though Git -servers can be configured to be globally writable, in which 'receive- -pack' initiation is also allowed) with which the client wishes to -communicate and executes it and connects it to the requesting -process. - -In the SSH transport, the client just runs the 'upload-pack' -or 'receive-pack' process on the server over the SSH protocol and then -communicates with that invoked process over the SSH connection. - -The file:// transport runs the 'upload-pack' or 'receive-pack' -process locally and communicates with it over a pipe. - -Extra Parameters ----------------- - -The protocol provides a mechanism in which clients can send additional -information in its first message to the server. These are called "Extra -Parameters", and are supported by the Git, SSH, and HTTP protocols. - -Each Extra Parameter takes the form of `<key>=<value>` or `<key>`. - -Servers that receive any such Extra Parameters MUST ignore all -unrecognized keys. Currently, the only Extra Parameter recognized is -"version" with a value of '1' or '2'. See protocol-v2.txt for more -information on protocol version 2. - -Git Transport -------------- - -The Git transport starts off by sending the command and repository -on the wire using the pkt-line format, followed by a NUL byte and a -hostname parameter, terminated by a NUL byte. - - 0033git-upload-pack /project.git\0host=myserver.com\0 - -The transport may send Extra Parameters by adding an additional NUL -byte, and then adding one or more NUL-terminated strings: - - 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0 - --- - git-proto-request = request-command SP pathname NUL - [ host-parameter NUL ] [ NUL extra-parameters ] - request-command = "git-upload-pack" / "git-receive-pack" / - "git-upload-archive" ; case sensitive - pathname = *( %x01-ff ) ; exclude NUL - host-parameter = "host=" hostname [ ":" port ] - extra-parameters = 1*extra-parameter - extra-parameter = 1*( %x01-ff ) NUL --- - -host-parameter is used for the -git-daemon name based virtual hosting. See --interpolated-path -option to git daemon, with the %H/%CH format characters. - -Basically what the Git client is doing to connect to an 'upload-pack' -process on the server side over the Git protocol is this: - - $ echo -e -n \ - "003agit-upload-pack /schacon/gitbook.git\0host=example.com\0" | - nc -v example.com 9418 - - -SSH Transport -------------- - -Initiating the upload-pack or receive-pack processes over SSH is -executing the binary on the server via SSH remote execution. -It is basically equivalent to running this: - - $ ssh git.example.com "git-upload-pack '/project.git'" - -For a server to support Git pushing and pulling for a given user over -SSH, that user needs to be able to execute one or both of those -commands via the SSH shell that they are provided on login. On some -systems, that shell access is limited to only being able to run those -two commands, or even just one of them. - -In an ssh:// format URI, it's absolute in the URI, so the '/' after -the host name (or port number) is sent as an argument, which is then -read by the remote git-upload-pack exactly as is, so it's effectively -an absolute path in the remote filesystem. - - git clone ssh://user@example.com/project.git - | - v - ssh user@example.com "git-upload-pack '/project.git'" - -In a "user@host:path" format URI, its relative to the user's home -directory, because the Git client will run: - - git clone user@example.com:project.git - | - v - ssh user@example.com "git-upload-pack 'project.git'" - -The exception is if a '~' is used, in which case -we execute it without the leading '/'. - - ssh://user@example.com/~alice/project.git, - | - v - ssh user@example.com "git-upload-pack '~alice/project.git'" - -Depending on the value of the `protocol.version` configuration variable, -Git may attempt to send Extra Parameters as a colon-separated string in -the GIT_PROTOCOL environment variable. This is done only if -the `ssh.variant` configuration variable indicates that the ssh command -supports passing environment variables as an argument. - -A few things to remember here: - -- The "command name" is spelled with dash (e.g. git-upload-pack), but - this can be overridden by the client; - -- The repository path is always quoted with single quotes. - -Fetching Data From a Server ---------------------------- - -When one Git repository wants to get data that a second repository -has, the first can 'fetch' from the second. This operation determines -what data the server has that the client does not then streams that -data down to the client in packfile format. - - -Reference Discovery -------------------- - -When the client initially connects the server will immediately respond -with a version number (if "version=1" is sent as an Extra Parameter), -and a listing of each reference it has (all branches and tags) along -with the object name that each reference currently points to. - - $ echo -e -n "0045git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" | - nc -v example.com 9418 - 000eversion 1 - 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack - side-band side-band-64k ofs-delta shallow no-progress include-tag - 00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration - 003f7217a7c7e582c46cec22a130adf4b9d7d950fba0 refs/heads/master - 003cb88d2441cac0977faf98efc80305012112238d9d refs/tags/v0.9 - 003c525128480b96c89e6418b1e40909bf6c5b2d580f refs/tags/v1.0 - 003fe92df48743b7bc7d26bcaabfddde0a1e20cae47c refs/tags/v1.0^{} - 0000 - -The returned response is a pkt-line stream describing each ref and -its current value. The stream MUST be sorted by name according to -the C locale ordering. - -If HEAD is a valid ref, HEAD MUST appear as the first advertised -ref. If HEAD is not a valid ref, HEAD MUST NOT appear in the -advertisement list at all, but other refs may still appear. - -The stream MUST include capability declarations behind a NUL on the -first ref. The peeled value of a ref (that is "ref^{}") MUST be -immediately after the ref itself, if presented. A conforming server -MUST peel the ref if it's an annotated tag. - ----- - advertised-refs = *1("version 1") - (no-refs / list-of-refs) - *shallow - flush-pkt - - no-refs = PKT-LINE(zero-id SP "capabilities^{}" - NUL capability-list) - - list-of-refs = first-ref *other-ref - first-ref = PKT-LINE(obj-id SP refname - NUL capability-list) - - other-ref = PKT-LINE(other-tip / other-peeled) - other-tip = obj-id SP refname - other-peeled = obj-id SP refname "^{}" - - shallow = PKT-LINE("shallow" SP obj-id) - - capability-list = capability *(SP capability) - capability = 1*(LC_ALPHA / DIGIT / "-" / "_") - LC_ALPHA = %x61-7A ----- - -Server and client MUST use lowercase for obj-id, both MUST treat obj-id -as case-insensitive. - -See protocol-capabilities.txt for a list of allowed server capabilities -and descriptions. - -Packfile Negotiation --------------------- -After reference and capabilities discovery, the client can decide to -terminate the connection by sending a flush-pkt, telling the server it can -now gracefully terminate, and disconnect, when it does not need any pack -data. This can happen with the ls-remote command, and also can happen when -the client already is up to date. - -Otherwise, it enters the negotiation phase, where the client and -server determine what the minimal packfile necessary for transport is, -by telling the server what objects it wants, its shallow objects -(if any), and the maximum commit depth it wants (if any). The client -will also send a list of the capabilities it wants to be in effect, -out of what the server said it could do with the first 'want' line. - ----- - upload-request = want-list - *shallow-line - *1depth-request - [filter-request] - flush-pkt - - want-list = first-want - *additional-want - - shallow-line = PKT-LINE("shallow" SP obj-id) - - depth-request = PKT-LINE("deepen" SP depth) / - PKT-LINE("deepen-since" SP timestamp) / - PKT-LINE("deepen-not" SP ref) - - first-want = PKT-LINE("want" SP obj-id SP capability-list) - additional-want = PKT-LINE("want" SP obj-id) - - depth = 1*DIGIT - - filter-request = PKT-LINE("filter" SP filter-spec) ----- - -Clients MUST send all the obj-ids it wants from the reference -discovery phase as 'want' lines. Clients MUST send at least one -'want' command in the request body. Clients MUST NOT mention an -obj-id in a 'want' command which did not appear in the response -obtained through ref discovery. - -The client MUST write all obj-ids which it only has shallow copies -of (meaning that it does not have the parents of a commit) as -'shallow' lines so that the server is aware of the limitations of -the client's history. - -The client now sends the maximum commit history depth it wants for -this transaction, which is the number of commits it wants from the -tip of the history, if any, as a 'deepen' line. A depth of 0 is the -same as not making a depth request. The client does not want to receive -any commits beyond this depth, nor does it want objects needed only to -complete those commits. Commits whose parents are not received as a -result are defined as shallow and marked as such in the server. This -information is sent back to the client in the next step. - -The client can optionally request that pack-objects omit various -objects from the packfile using one of several filtering techniques. -These are intended for use with partial clone and partial fetch -operations. An object that does not meet a filter-spec value is -omitted unless explicitly requested in a 'want' line. See `rev-list` -for possible filter-spec values. - -Once all the 'want's and 'shallow's (and optional 'deepen') are -transferred, clients MUST send a flush-pkt, to tell the server side -that it is done sending the list. - -Otherwise, if the client sent a positive depth request, the server -will determine which commits will and will not be shallow and -send this information to the client. If the client did not request -a positive depth, this step is skipped. - ----- - shallow-update = *shallow-line - *unshallow-line - flush-pkt - - shallow-line = PKT-LINE("shallow" SP obj-id) - - unshallow-line = PKT-LINE("unshallow" SP obj-id) ----- - -If the client has requested a positive depth, the server will compute -the set of commits which are no deeper than the desired depth. The set -of commits start at the client's wants. - -The server writes 'shallow' lines for each -commit whose parents will not be sent as a result. The server writes -an 'unshallow' line for each commit which the client has indicated is -shallow, but is no longer shallow at the currently requested depth -(that is, its parents will now be sent). The server MUST NOT mark -as unshallow anything which the client has not indicated was shallow. - -Now the client will send a list of the obj-ids it has using 'have' -lines, so the server can make a packfile that only contains the objects -that the client needs. In multi_ack mode, the canonical implementation -will send up to 32 of these at a time, then will send a flush-pkt. The -canonical implementation will skip ahead and send the next 32 immediately, -so that there is always a block of 32 "in-flight on the wire" at a time. - ----- - upload-haves = have-list - compute-end - - have-list = *have-line - have-line = PKT-LINE("have" SP obj-id) - compute-end = flush-pkt / PKT-LINE("done") ----- - -If the server reads 'have' lines, it then will respond by ACKing any -of the obj-ids the client said it had that the server also has. The -server will ACK obj-ids differently depending on which ack mode is -chosen by the client. - -In multi_ack mode: - - * the server will respond with 'ACK obj-id continue' for any common - commits. - - * once the server has found an acceptable common base commit and is - ready to make a packfile, it will blindly ACK all 'have' obj-ids - back to the client. - - * the server will then send a 'NAK' and then wait for another response - from the client - either a 'done' or another list of 'have' lines. - -In multi_ack_detailed mode: - - * the server will differentiate the ACKs where it is signaling - that it is ready to send data with 'ACK obj-id ready' lines, and - signals the identified common commits with 'ACK obj-id common' lines. - -Without either multi_ack or multi_ack_detailed: - - * upload-pack sends "ACK obj-id" on the first common object it finds. - After that it says nothing until the client gives it a "done". - - * upload-pack sends "NAK" on a flush-pkt if no common object - has been found yet. If one has been found, and thus an ACK - was already sent, it's silent on the flush-pkt. - -After the client has gotten enough ACK responses that it can determine -that the server has enough information to send an efficient packfile -(in the canonical implementation, this is determined when it has received -enough ACKs that it can color everything left in the --date-order queue -as common with the server, or the --date-order queue is empty), or the -client determines that it wants to give up (in the canonical implementation, -this is determined when the client sends 256 'have' lines without getting -any of them ACKed by the server - meaning there is nothing in common and -the server should just send all of its objects), then the client will send -a 'done' command. The 'done' command signals to the server that the client -is ready to receive its packfile data. - -However, the 256 limit *only* turns on in the canonical client -implementation if we have received at least one "ACK %s continue" -during a prior round. This helps to ensure that at least one common -ancestor is found before we give up entirely. - -Once the 'done' line is read from the client, the server will either -send a final 'ACK obj-id' or it will send a 'NAK'. 'obj-id' is the object -name of the last commit determined to be common. The server only sends -ACK after 'done' if there is at least one common base and multi_ack or -multi_ack_detailed is enabled. The server always sends NAK after 'done' -if there is no common base found. - -Instead of 'ACK' or 'NAK', the server may send an error message (for -example, if it does not recognize an object in a 'want' line received -from the client). - -Then the server will start sending its packfile data. - ----- - server-response = *ack_multi ack / nak - ack_multi = PKT-LINE("ACK" SP obj-id ack_status) - ack_status = "continue" / "common" / "ready" - ack = PKT-LINE("ACK" SP obj-id) - nak = PKT-LINE("NAK") ----- - -A simple clone may look like this (with no 'have' lines): - ----- - C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack \ - side-band-64k ofs-delta\n - C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n - C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n - C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n - C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n - C: 0000 - C: 0009done\n - - S: 0008NAK\n - S: [PACKFILE] ----- - -An incremental update (fetch) response might look like this: - ----- - C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack \ - side-band-64k ofs-delta\n - C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n - C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n - C: 0000 - C: 0032have 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n - C: [30 more have lines] - C: 0032have 74730d410fcb6603ace96f1dc55ea6196122532d\n - C: 0000 - - S: 003aACK 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 continue\n - S: 003aACK 74730d410fcb6603ace96f1dc55ea6196122532d continue\n - S: 0008NAK\n - - C: 0009done\n - - S: 0031ACK 74730d410fcb6603ace96f1dc55ea6196122532d\n - S: [PACKFILE] ----- - - -Packfile Data -------------- - -Now that the client and server have finished negotiation about what -the minimal amount of data that needs to be sent to the client is, the server -will construct and send the required data in packfile format. - -See pack-format.txt for what the packfile itself actually looks like. - -If 'side-band' or 'side-band-64k' capabilities have been specified by -the client, the server will send the packfile data multiplexed. - -Each packet starting with the packet-line length of the amount of data -that follows, followed by a single byte specifying the sideband the -following data is coming in on. - -In 'side-band' mode, it will send up to 999 data bytes plus 1 control -code, for a total of up to 1000 bytes in a pkt-line. In 'side-band-64k' -mode it will send up to 65519 data bytes plus 1 control code, for a -total of up to 65520 bytes in a pkt-line. - -The sideband byte will be a '1', '2' or a '3'. Sideband '1' will contain -packfile data, sideband '2' will be used for progress information that the -client will generally print to stderr and sideband '3' is used for error -information. - -If no 'side-band' capability was specified, the server will stream the -entire packfile without multiplexing. - - -Pushing Data To a Server ------------------------- - -Pushing data to a server will invoke the 'receive-pack' process on the -server, which will allow the client to tell it which references it should -update and then send all the data the server will need for those new -references to be complete. Once all the data is received and validated, -the server will then update its references to what the client specified. - -Authentication --------------- - -The protocol itself contains no authentication mechanisms. That is to be -handled by the transport, such as SSH, before the 'receive-pack' process is -invoked. If 'receive-pack' is configured over the Git transport, those -repositories will be writable by anyone who can access that port (9418) as -that transport is unauthenticated. - -Reference Discovery -------------------- - -The reference discovery phase is done nearly the same way as it is in the -fetching protocol. Each reference obj-id and name on the server is sent -in packet-line format to the client, followed by a flush-pkt. The only -real difference is that the capability listing is different - the only -possible values are 'report-status', 'report-status-v2', 'delete-refs', -'ofs-delta', 'atomic' and 'push-options'. - -Reference Update Request and Packfile Transfer ----------------------------------------------- - -Once the client knows what references the server is at, it can send a -list of reference update requests. For each reference on the server -that it wants to update, it sends a line listing the obj-id currently on -the server, the obj-id the client would like to update it to and the name -of the reference. - -This list is followed by a flush-pkt. - ----- - update-requests = *shallow ( command-list | push-cert ) - - shallow = PKT-LINE("shallow" SP obj-id) - - command-list = PKT-LINE(command NUL capability-list) - *PKT-LINE(command) - flush-pkt - - command = create / delete / update - create = zero-id SP new-id SP name - delete = old-id SP zero-id SP name - update = old-id SP new-id SP name - - old-id = obj-id - new-id = obj-id - - push-cert = PKT-LINE("push-cert" NUL capability-list LF) - PKT-LINE("certificate version 0.1" LF) - PKT-LINE("pusher" SP ident LF) - PKT-LINE("pushee" SP url LF) - PKT-LINE("nonce" SP nonce LF) - *PKT-LINE("push-option" SP push-option LF) - PKT-LINE(LF) - *PKT-LINE(command LF) - *PKT-LINE(gpg-signature-lines LF) - PKT-LINE("push-cert-end" LF) - - push-option = 1*( VCHAR | SP ) ----- - -If the server has advertised the 'push-options' capability and the client has -specified 'push-options' as part of the capability list above, the client then -sends its push options followed by a flush-pkt. - ----- - push-options = *PKT-LINE(push-option) flush-pkt ----- - -For backwards compatibility with older Git servers, if the client sends a push -cert and push options, it MUST send its push options both embedded within the -push cert and after the push cert. (Note that the push options within the cert -are prefixed, but the push options after the cert are not.) Both these lists -MUST be the same, modulo the prefix. - -After that the packfile that -should contain all the objects that the server will need to complete the new -references will be sent. - ----- - packfile = "PACK" 28*(OCTET) ----- - -If the receiving end does not support delete-refs, the sending end MUST -NOT ask for delete command. - -If the receiving end does not support push-cert, the sending end -MUST NOT send a push-cert command. When a push-cert command is -sent, command-list MUST NOT be sent; the commands recorded in the -push certificate is used instead. - -The packfile MUST NOT be sent if the only command used is 'delete'. - -A packfile MUST be sent if either create or update command is used, -even if the server already has all the necessary objects. In this -case the client MUST send an empty packfile. The only time this -is likely to happen is if the client is creating -a new branch or a tag that points to an existing obj-id. - -The server will receive the packfile, unpack it, then validate each -reference that is being updated that it hasn't changed while the request -was being processed (the obj-id is still the same as the old-id), and -it will run any update hooks to make sure that the update is acceptable. -If all of that is fine, the server will then update the references. - -Push Certificate ----------------- - -A push certificate begins with a set of header lines. After the -header and an empty line, the protocol commands follow, one per -line. Note that the trailing LF in push-cert PKT-LINEs is _not_ -optional; it must be present. - -Currently, the following header fields are defined: - -`pusher` ident:: - Identify the GPG key in "Human Readable Name <email@address>" - format. - -`pushee` url:: - The repository URL (anonymized, if the URL contains - authentication material) the user who ran `git push` - intended to push into. - -`nonce` nonce:: - The 'nonce' string the receiving repository asked the - pushing user to include in the certificate, to prevent - replay attacks. - -The GPG signature lines are a detached signature for the contents -recorded in the push certificate before the signature block begins. -The detached signature is used to certify that the commands were -given by the pusher, who must be the signer. - -Report Status -------------- - -After receiving the pack data from the sender, the receiver sends a -report if 'report-status' or 'report-status-v2' capability is in effect. -It is a short listing of what happened in that update. It will first -list the status of the packfile unpacking as either 'unpack ok' or -'unpack [error]'. Then it will list the status for each of the references -that it tried to update. Each line is either 'ok [refname]' if the -update was successful, or 'ng [refname] [error]' if the update was not. - ----- - report-status = unpack-status - 1*(command-status) - flush-pkt - - unpack-status = PKT-LINE("unpack" SP unpack-result) - unpack-result = "ok" / error-msg - - command-status = command-ok / command-fail - command-ok = PKT-LINE("ok" SP refname) - command-fail = PKT-LINE("ng" SP refname SP error-msg) - - error-msg = 1*(OCTET) ; where not "ok" ----- - -The 'report-status-v2' capability extends the protocol by adding new option -lines in order to support reporting of reference rewritten by the -'proc-receive' hook. The 'proc-receive' hook may handle a command for a -pseudo-reference which may create or update one or more references, and each -reference may have different name, different new-oid, and different old-oid. - ----- - report-status-v2 = unpack-status - 1*(command-status-v2) - flush-pkt - - unpack-status = PKT-LINE("unpack" SP unpack-result) - unpack-result = "ok" / error-msg - - command-status-v2 = command-ok-v2 / command-fail - command-ok-v2 = command-ok - *option-line - - command-ok = PKT-LINE("ok" SP refname) - command-fail = PKT-LINE("ng" SP refname SP error-msg) - - error-msg = 1*(OCTET) ; where not "ok" - - option-line = *1(option-refname) - *1(option-old-oid) - *1(option-new-oid) - *1(option-forced-update) - - option-refname = PKT-LINE("option" SP "refname" SP refname) - option-old-oid = PKT-LINE("option" SP "old-oid" SP obj-id) - option-new-oid = PKT-LINE("option" SP "new-oid" SP obj-id) - option-force = PKT-LINE("option" SP "forced-update") - ----- - -Updates can be unsuccessful for a number of reasons. The reference can have -changed since the reference discovery phase was originally sent, meaning -someone pushed in the meantime. The reference being pushed could be a -non-fast-forward reference and the update hooks or configuration could be -set to not allow that, etc. Also, some references can be updated while others -can be rejected. - -An example client/server communication might look like this: - ----- - S: 006274730d410fcb6603ace96f1dc55ea6196122532d refs/heads/local\0report-status delete-refs ofs-delta\n - S: 003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\n - S: 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\n - S: 003d74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/team\n - S: 0000 - - C: 00677d1665144a3a975c05f1f43902ddaf084e784dbe 74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/debug\n - C: 006874730d410fcb6603ace96f1dc55ea6196122532d 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/master\n - C: 0000 - C: [PACKDATA] - - S: 000eunpack ok\n - S: 0018ok refs/heads/debug\n - S: 002ang refs/heads/master non-fast-forward\n ----- diff --git a/third_party/git/Documentation/technical/packfile-uri.txt b/third_party/git/Documentation/technical/packfile-uri.txt deleted file mode 100644 index 318713abc371..000000000000 --- a/third_party/git/Documentation/technical/packfile-uri.txt +++ /dev/null @@ -1,78 +0,0 @@ -Packfile URIs -============= - -This feature allows servers to serve part of their packfile response as URIs. -This allows server designs that improve scalability in bandwidth and CPU usage -(for example, by serving some data through a CDN), and (in the future) provides -some measure of resumability to clients. - -This feature is available only in protocol version 2. - -Protocol --------- - -The server advertises the `packfile-uris` capability. - -If the client then communicates which protocols (HTTPS, etc.) it supports with -a `packfile-uris` argument, the server MAY send a `packfile-uris` section -directly before the `packfile` section (right after `wanted-refs` if it is -sent) containing URIs of any of the given protocols. The URIs point to -packfiles that use only features that the client has declared that it supports -(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of -this section. - -Clients should then download and index all the given URIs (in addition to -downloading and indexing the packfile given in the `packfile` section of the -response) before performing the connectivity check. - -Server design -------------- - -The server can be trivially made compatible with the proposed protocol by -having it advertise `packfile-uris`, tolerating the client sending -`packfile-uris`, and never sending any `packfile-uris` section. But we should -include some sort of non-trivial implementation in the Minimum Viable Product, -at least so that we can test the client. - -This is the implementation: a feature, marked experimental, that allows the -server to be configured by one or more `uploadpack.blobPackfileUri=<sha1> -<uri>` entries. Whenever the list of objects to be sent is assembled, all such -blobs are excluded, replaced with URIs. The client will download those URIs, -expecting them to each point to packfiles containing single blobs. - -Client design -------------- - -The client has a config variable `fetch.uriprotocols` that determines which -protocols the end user is willing to use. By default, this is empty. - -When the client downloads the given URIs, it should store them with "keep" -files, just like it does with the packfile in the `packfile` section. These -additional "keep" files can only be removed after the refs have been updated - -just like the "keep" file for the packfile in the `packfile` section. - -The division of work (initial fetch + additional URIs) introduces convenient -points for resumption of an interrupted clone - such resumption can be done -after the Minimum Viable Product (see "Future work"). - -Future work ------------ - -The protocol design allows some evolution of the server and client without any -need for protocol changes, so only a small-scoped design is included here to -form the MVP. For example, the following can be done: - - * On the server, more sophisticated means of excluding objects (e.g. by - specifying a commit to represent that commit and all objects that it - references). - * On the client, resumption of clone. If a clone is interrupted, information - could be recorded in the repository's config and a "clone-resume" command - can resume the clone in progress. (Resumption of subsequent fetches is more - difficult because that must deal with the user wanting to use the repository - even after the fetch was interrupted.) - -There are some possible features that will require a change in protocol: - - * Additional HTTP headers (e.g. authentication) - * Byte range support - * Different file formats referenced by URIs (e.g. raw object) diff --git a/third_party/git/Documentation/technical/partial-clone.txt b/third_party/git/Documentation/technical/partial-clone.txt deleted file mode 100644 index 0780d30caca6..000000000000 --- a/third_party/git/Documentation/technical/partial-clone.txt +++ /dev/null @@ -1,368 +0,0 @@ -Partial Clone Design Notes -========================== - -The "Partial Clone" feature is a performance optimization for Git that -allows Git to function without having a complete copy of the repository. -The goal of this work is to allow Git better handle extremely large -repositories. - -During clone and fetch operations, Git downloads the complete contents -and history of the repository. This includes all commits, trees, and -blobs for the complete life of the repository. For extremely large -repositories, clones can take hours (or days) and consume 100+GiB of disk -space. - -Often in these repositories there are many blobs and trees that the user -does not need such as: - - 1. files outside of the user's work area in the tree. For example, in - a repository with 500K directories and 3.5M files in every commit, - we can avoid downloading many objects if the user only needs a - narrow "cone" of the source tree. - - 2. large binary assets. For example, in a repository where large build - artifacts are checked into the tree, we can avoid downloading all - previous versions of these non-mergeable binary assets and only - download versions that are actually referenced. - -Partial clone allows us to avoid downloading such unneeded objects *in -advance* during clone and fetch operations and thereby reduce download -times and disk usage. Missing objects can later be "demand fetched" -if/when needed. - -A remote that can later provide the missing objects is called a -promisor remote, as it promises to send the objects when -requested. Initially Git supported only one promisor remote, the origin -remote from which the user cloned and that was configured in the -"extensions.partialClone" config option. Later support for more than -one promisor remote has been implemented. - -Use of partial clone requires that the user be online and the origin -remote or other promisor remotes be available for on-demand fetching -of missing objects. This may or may not be problematic for the user. -For example, if the user can stay within the pre-selected subset of -the source tree, they may not encounter any missing objects. -Alternatively, the user could try to pre-fetch various objects if they -know that they are going offline. - - -Non-Goals ---------- - -Partial clone is a mechanism to limit the number of blobs and trees downloaded -*within* a given range of commits -- and is therefore independent of and not -intended to conflict with existing DAG-level mechanisms to limit the set of -requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). - - -Design Overview ---------------- - -Partial clone logically consists of the following parts: - -- A mechanism for the client to describe unneeded or unwanted objects to - the server. - -- A mechanism for the server to omit such unwanted objects from packfiles - sent to the client. - -- A mechanism for the client to gracefully handle missing objects (that - were previously omitted by the server). - -- A mechanism for the client to backfill missing objects as needed. - - -Design Details --------------- - -- A new pack-protocol capability "filter" is added to the fetch-pack and - upload-pack negotiation. -+ -This uses the existing capability discovery mechanism. -See "filter" in Documentation/technical/pack-protocol.txt. - -- Clients pass a "filter-spec" to clone and fetch which is passed to the - server to request filtering during packfile construction. -+ -There are various filters available to accommodate different situations. -See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. - -- On the server pack-objects applies the requested filter-spec as it - creates "filtered" packfiles for the client. -+ -These filtered packfiles are *incomplete* in the traditional sense because -they may contain objects that reference objects not contained in the -packfile and that the client doesn't already have. For example, the -filtered packfile may contain trees or tags that reference missing blobs -or commits that reference missing trees. - -- On the client these incomplete packfiles are marked as "promisor packfiles" - and treated differently by various commands. - -- On the client a repository extension is added to the local config to - prevent older versions of git from failing mid-operation because of - missing objects that they cannot handle. - See "extensions.partialClone" in Documentation/technical/repository-version.txt" - - -Handling Missing Objects ------------------------- - -- An object may be missing due to a partial clone or fetch, or missing - due to repository corruption. To differentiate these cases, the - local repository specially indicates such filtered packfiles - obtained from promisor remotes as "promisor packfiles". -+ -These promisor packfiles consist of a "<name>.promisor" file with -arbitrary contents (like the "<name>.keep" files), in addition to -their "<name>.pack" and "<name>.idx" files. - -- The local repository considers a "promisor object" to be an object that - it knows (to the best of its ability) that promisor remotes have promised - that they have, either because the local repository has that object in one of - its promisor packfiles, or because another promisor object refers to it. -+ -When Git encounters a missing object, Git can see if it is a promisor object -and handle it appropriately. If not, Git can report a corruption. -+ -This means that there is no need for the client to explicitly maintain an -expensive-to-modify list of missing objects.[a] - -- Since almost all Git code currently expects any referenced object to be - present locally and because we do not want to force every command to do - a dry-run first, a fallback mechanism is added to allow Git to attempt - to dynamically fetch missing objects from promisor remotes. -+ -When the normal object lookup fails to find an object, Git invokes -promisor_remote_get_direct() to try to get the object from a promisor -remote and then retry the object lookup. This allows objects to be -"faulted in" without complicated prediction algorithms. -+ -For efficiency reasons, no check as to whether the missing object is -actually a promisor object is performed. -+ -Dynamic object fetching tends to be slow as objects are fetched one at -a time. - -- `checkout` (and any other command using `unpack-trees`) has been taught - to bulk pre-fetch all required missing blobs in a single batch. - -- `rev-list` has been taught to print missing objects. -+ -This can be used by other commands to bulk prefetch objects. -For example, a "git log -p A..B" may internally want to first do -something like "git rev-list --objects --quiet --missing=print A..B" -and prefetch those objects in bulk. - -- `fsck` has been updated to be fully aware of promisor objects. - -- `repack` in GC has been updated to not touch promisor packfiles at all, - and to only repack other objects. - -- The global variable "fetch_if_missing" is used to control whether an - object lookup will attempt to dynamically fetch a missing object or - report an error. -+ -We are not happy with this global variable and would like to remove it, -but that requires significant refactoring of the object code to pass an -additional flag. - - -Fetching Missing Objects ------------------------- - -- Fetching of objects is done by invoking a "git fetch" subprocess. - -- The local repository sends a request with the hashes of all requested - objects, and does not perform any packfile negotiation. - It then receives a packfile. - -- Because we are reusing the existing fetch mechanism, fetching - currently fetches all objects referred to by the requested objects, even - though they are not necessary. - - -Using many promisor remotes ---------------------------- - -Many promisor remotes can be configured and used. - -This allows for example a user to have multiple geographically-close -cache servers for fetching missing blobs while continuing to do -filtered `git-fetch` commands from the central server. - -When fetching objects, promisor remotes are tried one after the other -until all the objects have been fetched. - -Remotes that are considered "promisor" remotes are those specified by -the following configuration variables: - -- `extensions.partialClone = <name>` - -- `remote.<name>.promisor = true` - -- `remote.<name>.partialCloneFilter = ...` - -Only one promisor remote can be configured using the -`extensions.partialClone` config variable. This promisor remote will -be the last one tried when fetching objects. - -We decided to make it the last one we try, because it is likely that -someone using many promisor remotes is doing so because the other -promisor remotes are better for some reason (maybe they are closer or -faster for some kind of objects) than the origin, and the origin is -likely to be the remote specified by extensions.partialClone. - -This justification is not very strong, but one choice had to be made, -and anyway the long term plan should be to make the order somehow -fully configurable. - -For now though the other promisor remotes will be tried in the order -they appear in the config file. - -Current Limitations -------------------- - -- It is not possible to specify the order in which the promisor - remotes are tried in other ways than the order in which they appear - in the config file. -+ -It is also not possible to specify an order to be used when fetching -from one remote and a different order when fetching from another -remote. - -- It is not possible to push only specific objects to a promisor - remote. -+ -It is not possible to push at the same time to multiple promisor -remote in a specific order. - -- Dynamic object fetching will only ask promisor remotes for missing - objects. We assume that promisor remotes have a complete view of the - repository and can satisfy all such requests. - -- Repack essentially treats promisor and non-promisor packfiles as 2 - distinct partitions and does not mix them. Repack currently only works - on non-promisor packfiles and loose objects. - -- Dynamic object fetching invokes fetch-pack once *for each item* - because most algorithms stumble upon a missing object and need to have - it resolved before continuing their work. This may incur significant - overhead -- and multiple authentication requests -- if many objects are - needed. - -- Dynamic object fetching currently uses the existing pack protocol V0 - which means that each object is requested via fetch-pack. The server - will send a full set of info/refs when the connection is established. - If there are large number of refs, this may incur significant overhead. - - -Future Work ------------ - -- Improve the way to specify the order in which promisor remotes are - tried. -+ -For example this could allow to specify explicitly something like: -"When fetching from this remote, I want to use these promisor remotes -in this order, though, when pushing or fetching to that remote, I want -to use those promisor remotes in that order." - -- Allow pushing to promisor remotes. -+ -The user might want to work in a triangular work flow with multiple -promisor remotes that each have an incomplete view of the repository. - -- Allow repack to work on promisor packfiles (while keeping them distinct - from non-promisor packfiles). - -- Allow non-pathname-based filters to make use of packfile bitmaps (when - present). This was just an omission during the initial implementation. - -- Investigate use of a long-running process to dynamically fetch a series - of objects, such as proposed in [5,6] to reduce process startup and - overhead costs. -+ -It would be nice if pack protocol V2 could allow that long-running -process to make a series of requests over a single long-running -connection. - -- Investigate pack protocol V2 to avoid the info/refs broadcast on - each connection with the server to dynamically fetch missing objects. - -- Investigate the need to handle loose promisor objects. -+ -Objects in promisor packfiles are allowed to reference missing objects -that can be dynamically fetched from the server. An assumption was -made that loose objects are only created locally and therefore should -not reference a missing object. We may need to revisit that assumption -if, for example, we dynamically fetch a missing tree and store it as a -loose object rather than a single object packfile. -+ -This does not necessarily mean we need to mark loose objects as promisor; -it may be sufficient to relax the object lookup or is-promisor functions. - - -Non-Tasks ---------- - -- Every time the subject of "demand loading blobs" comes up it seems - that someone suggests that the server be allowed to "guess" and send - additional objects that may be related to the requested objects. -+ -No work has gone into actually doing that; we're just documenting that -it is a common suggestion. We're not sure how it would work and have -no plans to work on it. -+ -It is valid for the server to send more objects than requested (even -for a dynamic object fetch), but we are not building on that. - - -Footnotes ---------- - -[a] expensive-to-modify list of missing objects: Earlier in the design of - partial clone we discussed the need for a single list of missing objects. - This would essentially be a sorted linear list of OIDs that the were - omitted by the server during a clone or subsequent fetches. - -This file would need to be loaded into memory on every object lookup. -It would need to be read, updated, and re-written (like the .git/index) -on every explicit "git fetch" command *and* on any dynamic object fetch. - -The cost to read, update, and write this file could add significant -overhead to every command if there are many missing objects. For example, -if there are 100M missing blobs, this file would be at least 2GiB on disk. - -With the "promisor" concept, we *infer* a missing object based upon the -type of packfile that references it. - - -Related Links -------------- -[0] https://crbug.com/git/2 - Bug#2: Partial Clone - -[1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + - Subject: [RFC] Add support for downloading blobs on demand + - Date: Fri, 13 Jan 2017 10:52:53 -0500 - -[2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ + - Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + - Date: Fri, 29 Sep 2017 13:11:36 -0700 - -[3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + - Subject: Proposal for missing blob support in Git repos + - Date: Wed, 26 Apr 2017 15:13:46 -0700 - -[4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + - Subject: [PATCH 00/10] RFC Partial Clone and Fetch + - Date: Wed, 8 Mar 2017 18:50:29 +0000 - -[5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + - Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + - Date: Fri, 5 May 2017 11:27:52 -0400 - -[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + - Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + - Date: Fri, 14 Jul 2017 09:26:50 -0400 diff --git a/third_party/git/Documentation/technical/protocol-capabilities.txt b/third_party/git/Documentation/technical/protocol-capabilities.txt deleted file mode 100644 index ba869a7d366a..000000000000 --- a/third_party/git/Documentation/technical/protocol-capabilities.txt +++ /dev/null @@ -1,367 +0,0 @@ -Git Protocol Capabilities -========================= - -NOTE: this document describes capabilities for versions 0 and 1 of the pack -protocol. For version 2, please refer to the link:protocol-v2.html[protocol-v2] -doc. - -Servers SHOULD support all capabilities defined in this document. - -On the very first line of the initial server response of either -receive-pack and upload-pack the first reference is followed by -a NUL byte and then a list of space delimited server capabilities. -These allow the server to declare what it can and cannot support -to the client. - -Client will then send a space separated list of capabilities it wants -to be in effect. The client MUST NOT ask for capabilities the server -did not say it supports. - -Server MUST diagnose and abort if capabilities it does not understand -was sent. Server MUST NOT ignore capabilities that client requested -and server advertised. As a consequence of these rules, server MUST -NOT advertise capabilities it does not understand. - -The 'atomic', 'report-status', 'report-status-v2', 'delete-refs', 'quiet', -and 'push-cert' capabilities are sent and recognized by the receive-pack -(push to server) process. - -The 'ofs-delta' and 'side-band-64k' capabilities are sent and recognized -by both upload-pack and receive-pack protocols. The 'agent' capability -may optionally be sent in both protocols. - -All other capabilities are only recognized by the upload-pack (fetch -from server) process. - -multi_ack ---------- - -The 'multi_ack' capability allows the server to return "ACK obj-id -continue" as soon as it finds a commit that it can use as a common -base, between the client's wants and the client's have set. - -By sending this early, the server can potentially head off the client -from walking any further down that particular branch of the client's -repository history. The client may still need to walk down other -branches, sending have lines for those, until the server has a -complete cut across the DAG, or the client has said "done". - -Without multi_ack, a client sends have lines in --date-order until -the server has found a common base. That means the client will send -have lines that are already known by the server to be common, because -they overlap in time with another branch that the server hasn't found -a common base on yet. - -For example suppose the client has commits in caps that the server -doesn't and the server has commits in lower case that the client -doesn't, as in the following diagram: - - +---- u ---------------------- x - / +----- y - / / - a -- b -- c -- d -- E -- F - \ - +--- Q -- R -- S - -If the client wants x,y and starts out by saying have F,S, the server -doesn't know what F,S is. Eventually the client says "have d" and -the server sends "ACK d continue" to let the client know to stop -walking down that line (so don't send c-b-a), but it's not done yet, -it needs a base for x. The client keeps going with S-R-Q, until a -gets reached, at which point the server has a clear base and it all -ends. - -Without multi_ack the client would have sent that c-b-a chain anyway, -interleaved with S-R-Q. - -multi_ack_detailed ------------------- -This is an extension of multi_ack that permits client to better -understand the server's in-memory state. See pack-protocol.txt, -section "Packfile Negotiation" for more information. - -no-done -------- -This capability should only be used with the smart HTTP protocol. If -multi_ack_detailed and no-done are both present, then the sender is -free to immediately send a pack following its first "ACK obj-id ready" -message. - -Without no-done in the smart HTTP protocol, the server session would -end and the client has to make another trip to send "done" before -the server can send the pack. no-done removes the last round and -thus slightly reduces latency. - -thin-pack ---------- - -A thin pack is one with deltas which reference base objects not -contained within the pack (but are known to exist at the receiving -end). This can reduce the network traffic significantly, but it -requires the receiving end to know how to "thicken" these packs by -adding the missing bases to the pack. - -The upload-pack server advertises 'thin-pack' when it can generate -and send a thin pack. A client requests the 'thin-pack' capability -when it understands how to "thicken" it, notifying the server that -it can receive such a pack. A client MUST NOT request the -'thin-pack' capability if it cannot turn a thin pack into a -self-contained pack. - -Receive-pack, on the other hand, is assumed by default to be able to -handle thin packs, but can ask the client not to use the feature by -advertising the 'no-thin' capability. A client MUST NOT send a thin -pack if the server advertises the 'no-thin' capability. - -The reasons for this asymmetry are historical. The receive-pack -program did not exist until after the invention of thin packs, so -historically the reference implementation of receive-pack always -understood thin packs. Adding 'no-thin' later allowed receive-pack -to disable the feature in a backwards-compatible manner. - - -side-band, side-band-64k ------------------------- - -This capability means that server can send, and client understand multiplexed -progress reports and error info interleaved with the packfile itself. - -These two options are mutually exclusive. A modern client always -favors 'side-band-64k'. - -Either mode indicates that the packfile data will be streamed broken -up into packets of up to either 1000 bytes in the case of 'side_band', -or 65520 bytes in the case of 'side_band_64k'. Each packet is made up -of a leading 4-byte pkt-line length of how much data is in the packet, -followed by a 1-byte stream code, followed by the actual data. - -The stream code can be one of: - - 1 - pack data - 2 - progress messages - 3 - fatal error message just before stream aborts - -The "side-band-64k" capability came about as a way for newer clients -that can handle much larger packets to request packets that are -actually crammed nearly full, while maintaining backward compatibility -for the older clients. - -Further, with side-band and its up to 1000-byte messages, it's actually -999 bytes of payload and 1 byte for the stream code. With side-band-64k, -same deal, you have up to 65519 bytes of data and 1 byte for the stream -code. - -The client MUST send only maximum of one of "side-band" and "side- -band-64k". Server MUST diagnose it as an error if client requests -both. - -ofs-delta ---------- - -Server can send, and client understand PACKv2 with delta referring to -its base by position in pack rather than by an obj-id. That is, they can -send/read OBJ_OFS_DELTA (aka type 6) in a packfile. - -agent ------ - -The server may optionally send a capability of the form `agent=X` to -notify the client that the server is running version `X`. The client may -optionally return its own agent string by responding with an `agent=Y` -capability (but it MUST NOT do so if the server did not mention the -agent capability). The `X` and `Y` strings may contain any printable -ASCII characters except space (i.e., the byte range 32 < x < 127), and -are typically of the form "package/version" (e.g., "git/1.8.3.1"). The -agent strings are purely informative for statistics and debugging -purposes, and MUST NOT be used to programmatically assume the presence -or absence of particular features. - -object-format -------------- - -This capability, which takes a hash algorithm as an argument, indicates -that the server supports the given hash algorithms. It may be sent -multiple times; if so, the first one given is the one used in the ref -advertisement. - -When provided by the client, this indicates that it intends to use the -given hash algorithm to communicate. The algorithm provided must be one -that the server supports. - -If this capability is not provided, it is assumed that the only -supported algorithm is SHA-1. - -symref ------- - -This parameterized capability is used to inform the receiver which symbolic ref -points to which ref; for example, "symref=HEAD:refs/heads/master" tells the -receiver that HEAD points to master. This capability can be repeated to -represent multiple symrefs. - -Servers SHOULD include this capability for the HEAD symref if it is one of the -refs being sent. - -Clients MAY use the parameters from this capability to select the proper initial -branch when cloning a repository. - -shallow -------- - -This capability adds "deepen", "shallow" and "unshallow" commands to -the fetch-pack/upload-pack protocol so clients can request shallow -clones. - -deepen-since ------------- - -This capability adds "deepen-since" command to fetch-pack/upload-pack -protocol so the client can request shallow clones that are cut at a -specific time, instead of depth. Internally it's equivalent of doing -"rev-list --max-age=<timestamp>" on the server side. "deepen-since" -cannot be used with "deepen". - -deepen-not ----------- - -This capability adds "deepen-not" command to fetch-pack/upload-pack -protocol so the client can request shallow clones that are cut at a -specific revision, instead of depth. Internally it's equivalent of -doing "rev-list --not <rev>" on the server side. "deepen-not" -cannot be used with "deepen", but can be used with "deepen-since". - -deepen-relative ---------------- - -If this capability is requested by the client, the semantics of -"deepen" command is changed. The "depth" argument is the depth from -the current shallow boundary, instead of the depth from remote refs. - -no-progress ------------ - -The client was started with "git clone -q" or something, and doesn't -want that side band 2. Basically the client just says "I do not -wish to receive stream 2 on sideband, so do not send it to me, and if -you did, I will drop it on the floor anyway". However, the sideband -channel 3 is still used for error responses. - -include-tag ------------ - -The 'include-tag' capability is about sending annotated tags if we are -sending objects they point to. If we pack an object to the client, and -a tag object points exactly at that object, we pack the tag object too. -In general this allows a client to get all new annotated tags when it -fetches a branch, in a single network connection. - -Clients MAY always send include-tag, hardcoding it into a request when -the server advertises this capability. The decision for a client to -request include-tag only has to do with the client's desires for tag -data, whether or not a server had advertised objects in the -refs/tags/* namespace. - -Servers MUST pack the tags if their referrant is packed and the client -has requested include-tags. - -Clients MUST be prepared for the case where a server has ignored -include-tag and has not actually sent tags in the pack. In such -cases the client SHOULD issue a subsequent fetch to acquire the tags -that include-tag would have otherwise given the client. - -The server SHOULD send include-tag, if it supports it, regardless -of whether or not there are tags available. - -report-status -------------- - -The receive-pack process can receive a 'report-status' capability, -which tells it that the client wants a report of what happened after -a packfile upload and reference update. If the pushing client requests -this capability, after unpacking and updating references the server -will respond with whether the packfile unpacked successfully and if -each reference was updated successfully. If any of those were not -successful, it will send back an error message. See pack-protocol.txt -for example messages. - -report-status-v2 ----------------- - -Capability 'report-status-v2' extends capability 'report-status' by -adding new "option" directives in order to support reference rewritten by -the "proc-receive" hook. The "proc-receive" hook may handle a command -for a pseudo-reference which may create or update a reference with -different name, new-oid, and old-oid. While the capability -'report-status' cannot report for such case. See pack-protocol.txt -for details. - -delete-refs ------------ - -If the server sends back the 'delete-refs' capability, it means that -it is capable of accepting a zero-id value as the target -value of a reference update. It is not sent back by the client, it -simply informs the client that it can be sent zero-id values -to delete references. - -quiet ------ - -If the receive-pack server advertises the 'quiet' capability, it is -capable of silencing human-readable progress output which otherwise may -be shown when processing the received pack. A send-pack client should -respond with the 'quiet' capability to suppress server-side progress -reporting if the local progress reporting is also being suppressed -(e.g., via `push -q`, or if stderr does not go to a tty). - -atomic ------- - -If the server sends the 'atomic' capability it is capable of accepting -atomic pushes. If the pushing client requests this capability, the server -will update the refs in one atomic transaction. Either all refs are -updated or none. - -push-options ------------- - -If the server sends the 'push-options' capability it is able to accept -push options after the update commands have been sent, but before the -packfile is streamed. If the pushing client requests this capability, -the server will pass the options to the pre- and post- receive hooks -that process this push request. - -allow-tip-sha1-in-want ----------------------- - -If the upload-pack server advertises this capability, fetch-pack may -send "want" lines with object names that exist at the server but are not -advertised by upload-pack. For historical reasons, the name of this -capability contains "sha1". Object names are always given using the -object format negotiated through the 'object-format' capability. - -allow-reachable-sha1-in-want ----------------------------- - -If the upload-pack server advertises this capability, fetch-pack may -send "want" lines with object names that exist at the server but are not -advertised by upload-pack. For historical reasons, the name of this -capability contains "sha1". Object names are always given using the -object format negotiated through the 'object-format' capability. - -push-cert=<nonce> ------------------ - -The receive-pack server that advertises this capability is willing -to accept a signed push certificate, and asks the <nonce> to be -included in the push certificate. A send-pack client MUST NOT -send a push-cert packet unless the receive-pack server advertises -this capability. - -filter ------- - -If the upload-pack server advertises the 'filter' capability, -fetch-pack may send "filter" commands to request a partial clone -or partial fetch and request that the server omit various objects -from the packfile. diff --git a/third_party/git/Documentation/technical/protocol-common.txt b/third_party/git/Documentation/technical/protocol-common.txt deleted file mode 100644 index ecedb34bba54..000000000000 --- a/third_party/git/Documentation/technical/protocol-common.txt +++ /dev/null @@ -1,99 +0,0 @@ -Documentation Common to Pack and Http Protocols -=============================================== - -ABNF Notation -------------- - -ABNF notation as described by RFC 5234 is used within the protocol documents, -except the following replacement core rules are used: ----- - HEXDIG = DIGIT / "a" / "b" / "c" / "d" / "e" / "f" ----- - -We also define the following common rules: ----- - NUL = %x00 - zero-id = 40*"0" - obj-id = 40*(HEXDIGIT) - - refname = "HEAD" - refname /= "refs/" <see discussion below> ----- - -A refname is a hierarchical octet string beginning with "refs/" and -not violating the 'git-check-ref-format' command's validation rules. -More specifically, they: - -. They can include slash `/` for hierarchical (directory) - grouping, but no slash-separated component can begin with a - dot `.`. - -. They must contain at least one `/`. This enforces the presence of a - category like `heads/`, `tags/` etc. but the actual names are not - restricted. - -. They cannot have two consecutive dots `..` anywhere. - -. They cannot have ASCII control characters (i.e. bytes whose - values are lower than \040, or \177 `DEL`), space, tilde `~`, - caret `^`, colon `:`, question-mark `?`, asterisk `*`, - or open bracket `[` anywhere. - -. They cannot end with a slash `/` or a dot `.`. - -. They cannot end with the sequence `.lock`. - -. They cannot contain a sequence `@{`. - -. They cannot contain a `\\`. - - -pkt-line Format ---------------- - -Much (but not all) of the payload is described around pkt-lines. - -A pkt-line is a variable length binary string. The first four bytes -of the line, the pkt-len, indicates the total length of the line, -in hexadecimal. The pkt-len includes the 4 bytes used to contain -the length's hexadecimal representation. - -A pkt-line MAY contain binary data, so implementors MUST ensure -pkt-line parsing/formatting routines are 8-bit clean. - -A non-binary line SHOULD BE terminated by an LF, which if present -MUST be included in the total length. Receivers MUST treat pkt-lines -with non-binary data the same whether or not they contain the trailing -LF (stripping the LF if present, and not complaining when it is -missing). - -The maximum length of a pkt-line's data component is 65516 bytes. -Implementations MUST NOT send pkt-line whose length exceeds 65520 -(65516 bytes of payload + 4 bytes of length data). - -Implementations SHOULD NOT send an empty pkt-line ("0004"). - -A pkt-line with a length field of 0 ("0000"), called a flush-pkt, -is a special case and MUST be handled differently than an empty -pkt-line ("0004"). - ----- - pkt-line = data-pkt / flush-pkt - - data-pkt = pkt-len pkt-payload - pkt-len = 4*(HEXDIG) - pkt-payload = (pkt-len - 4)*(OCTET) - - flush-pkt = "0000" ----- - -Examples (as C-style strings): - ----- - pkt-line actual value - --------------------------------- - "0006a\n" "a\n" - "0005a" "a" - "000bfoobar\n" "foobar\n" - "0004" "" ----- diff --git a/third_party/git/Documentation/technical/protocol-v2.txt b/third_party/git/Documentation/technical/protocol-v2.txt deleted file mode 100644 index e597b74da39d..000000000000 --- a/third_party/git/Documentation/technical/protocol-v2.txt +++ /dev/null @@ -1,494 +0,0 @@ -Git Wire Protocol, Version 2 -============================ - -This document presents a specification for a version 2 of Git's wire -protocol. Protocol v2 will improve upon v1 in the following ways: - - * Instead of multiple service names, multiple commands will be - supported by a single service - * Easily extendable as capabilities are moved into their own section - of the protocol, no longer being hidden behind a NUL byte and - limited by the size of a pkt-line - * Separate out other information hidden behind NUL bytes (e.g. agent - string as a capability and symrefs can be requested using 'ls-refs') - * Reference advertisement will be omitted unless explicitly requested - * ls-refs command to explicitly request some refs - * Designed with http and stateless-rpc in mind. With clear flush - semantics the http remote helper can simply act as a proxy - -In protocol v2 communication is command oriented. When first contacting a -server a list of capabilities will advertised. Some of these capabilities -will be commands which a client can request be executed. Once a command -has completed, a client can reuse the connection and request that other -commands be executed. - -Packet-Line Framing -------------------- - -All communication is done using packet-line framing, just as in v1. See -`Documentation/technical/pack-protocol.txt` and -`Documentation/technical/protocol-common.txt` for more information. - -In protocol v2 these special packets will have the following semantics: - - * '0000' Flush Packet (flush-pkt) - indicates the end of a message - * '0001' Delimiter Packet (delim-pkt) - separates sections of a message - * '0002' Message Packet (response-end-pkt) - indicates the end of a response - for stateless connections - -Initial Client Request ----------------------- - -In general a client can request to speak protocol v2 by sending -`version=2` through the respective side-channel for the transport being -used which inevitably sets `GIT_PROTOCOL`. More information can be -found in `pack-protocol.txt` and `http-protocol.txt`. In all cases the -response from the server is the capability advertisement. - -Git Transport -~~~~~~~~~~~~~ - -When using the git:// transport, you can request to use protocol v2 by -sending "version=2" as an extra parameter: - - 003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0 - -SSH and File Transport -~~~~~~~~~~~~~~~~~~~~~~ - -When using either the ssh:// or file:// transport, the GIT_PROTOCOL -environment variable must be set explicitly to include "version=2". - -HTTP Transport -~~~~~~~~~~~~~~ - -When using the http:// or https:// transport a client makes a "smart" -info/refs request as described in `http-protocol.txt` and requests that -v2 be used by supplying "version=2" in the `Git-Protocol` header. - - C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 - C: Git-Protocol: version=2 - -A v2 server would reply: - - S: 200 OK - S: <Some headers> - S: ... - S: - S: 000eversion 2\n - S: <capability-advertisement> - -Subsequent requests are then made directly to the service -`$GIT_URL/git-upload-pack`. (This works the same for git-receive-pack). - -Capability Advertisement ------------------------- - -A server which decides to communicate (based on a request from a client) -using protocol version 2, notifies the client by sending a version string -in its initial response followed by an advertisement of its capabilities. -Each capability is a key with an optional value. Clients must ignore all -unknown keys. Semantics of unknown values are left to the definition of -each key. Some capabilities will describe commands which can be requested -to be executed by the client. - - capability-advertisement = protocol-version - capability-list - flush-pkt - - protocol-version = PKT-LINE("version 2" LF) - capability-list = *capability - capability = PKT-LINE(key[=value] LF) - - key = 1*(ALPHA | DIGIT | "-_") - value = 1*(ALPHA | DIGIT | " -_.,?\/{}[]()<>!@#$%^&*+=:;") - -Command Request ---------------- - -After receiving the capability advertisement, a client can then issue a -request to select the command it wants with any particular capabilities -or arguments. There is then an optional section where the client can -provide any command specific parameters or queries. Only a single -command can be requested at a time. - - request = empty-request | command-request - empty-request = flush-pkt - command-request = command - capability-list - [command-args] - flush-pkt - command = PKT-LINE("command=" key LF) - command-args = delim-pkt - *command-specific-arg - - command-specific-args are packet line framed arguments defined by - each individual command. - -The server will then check to ensure that the client's request is -comprised of a valid command as well as valid capabilities which were -advertised. If the request is valid the server will then execute the -command. A server MUST wait till it has received the client's entire -request before issuing a response. The format of the response is -determined by the command being executed, but in all cases a flush-pkt -indicates the end of the response. - -When a command has finished, and the client has received the entire -response from the server, a client can either request that another -command be executed or can terminate the connection. A client may -optionally send an empty request consisting of just a flush-pkt to -indicate that no more requests will be made. - -Capabilities ------------- - -There are two different types of capabilities: normal capabilities, -which can be used to convey information or alter the behavior of a -request, and commands, which are the core actions that a client wants to -perform (fetch, push, etc). - -Protocol version 2 is stateless by default. This means that all commands -must only last a single round and be stateless from the perspective of the -server side, unless the client has requested a capability indicating that -state should be maintained by the server. Clients MUST NOT require state -management on the server side in order to function correctly. This -permits simple round-robin load-balancing on the server side, without -needing to worry about state management. - -agent -~~~~~ - -The server can advertise the `agent` capability with a value `X` (in the -form `agent=X`) to notify the client that the server is running version -`X`. The client may optionally send its own agent string by including -the `agent` capability with a value `Y` (in the form `agent=Y`) in its -request to the server (but it MUST NOT do so if the server did not -advertise the agent capability). The `X` and `Y` strings may contain any -printable ASCII characters except space (i.e., the byte range 32 < x < -127), and are typically of the form "package/version" (e.g., -"git/1.8.3.1"). The agent strings are purely informative for statistics -and debugging purposes, and MUST NOT be used to programmatically assume -the presence or absence of particular features. - -ls-refs -~~~~~~~ - -`ls-refs` is the command used to request a reference advertisement in v2. -Unlike the current reference advertisement, ls-refs takes in arguments -which can be used to limit the refs sent from the server. - -Additional features not supported in the base command will be advertised -as the value of the command in the capability advertisement in the form -of a space separated list of features: "<command>=<feature 1> <feature 2>" - -ls-refs takes in the following arguments: - - symrefs - In addition to the object pointed by it, show the underlying ref - pointed by it when showing a symbolic ref. - peel - Show peeled tags. - ref-prefix <prefix> - When specified, only references having a prefix matching one of - the provided prefixes are displayed. - -The output of ls-refs is as follows: - - output = *ref - flush-pkt - ref = PKT-LINE(obj-id SP refname *(SP ref-attribute) LF) - ref-attribute = (symref | peeled) - symref = "symref-target:" symref-target - peeled = "peeled:" obj-id - -fetch -~~~~~ - -`fetch` is the command used to fetch a packfile in v2. It can be looked -at as a modified version of the v1 fetch where the ref-advertisement is -stripped out (since the `ls-refs` command fills that role) and the -message format is tweaked to eliminate redundancies and permit easy -addition of future extensions. - -Additional features not supported in the base command will be advertised -as the value of the command in the capability advertisement in the form -of a space separated list of features: "<command>=<feature 1> <feature 2>" - -A `fetch` request can take the following arguments: - - want <oid> - Indicates to the server an object which the client wants to - retrieve. Wants can be anything and are not limited to - advertised objects. - - have <oid> - Indicates to the server an object which the client has locally. - This allows the server to make a packfile which only contains - the objects that the client needs. Multiple 'have' lines can be - supplied. - - done - Indicates to the server that negotiation should terminate (or - not even begin if performing a clone) and that the server should - use the information supplied in the request to construct the - packfile. - - thin-pack - Request that a thin pack be sent, which is a pack with deltas - which reference base objects not contained within the pack (but - are known to exist at the receiving end). This can reduce the - network traffic significantly, but it requires the receiving end - to know how to "thicken" these packs by adding the missing bases - to the pack. - - no-progress - Request that progress information that would normally be sent on - side-band channel 2, during the packfile transfer, should not be - sent. However, the side-band channel 3 is still used for error - responses. - - include-tag - Request that annotated tags should be sent if the objects they - point to are being sent. - - ofs-delta - Indicate that the client understands PACKv2 with delta referring - to its base by position in pack rather than by an oid. That is, - they can read OBJ_OFS_DELTA (aka type 6) in a packfile. - -If the 'shallow' feature is advertised the following arguments can be -included in the clients request as well as the potential addition of the -'shallow-info' section in the server's response as explained below. - - shallow <oid> - A client must notify the server of all commits for which it only - has shallow copies (meaning that it doesn't have the parents of - a commit) by supplying a 'shallow <oid>' line for each such - object so that the server is aware of the limitations of the - client's history. This is so that the server is aware that the - client may not have all objects reachable from such commits. - - deepen <depth> - Requests that the fetch/clone should be shallow having a commit - depth of <depth> relative to the remote side. - - deepen-relative - Requests that the semantics of the "deepen" command be changed - to indicate that the depth requested is relative to the client's - current shallow boundary, instead of relative to the requested - commits. - - deepen-since <timestamp> - Requests that the shallow clone/fetch should be cut at a - specific time, instead of depth. Internally it's equivalent to - doing "git rev-list --max-age=<timestamp>". Cannot be used with - "deepen". - - deepen-not <rev> - Requests that the shallow clone/fetch should be cut at a - specific revision specified by '<rev>', instead of a depth. - Internally it's equivalent of doing "git rev-list --not <rev>". - Cannot be used with "deepen", but can be used with - "deepen-since". - -If the 'filter' feature is advertised, the following argument can be -included in the client's request: - - filter <filter-spec> - Request that various objects from the packfile be omitted - using one of several filtering techniques. These are intended - for use with partial clone and partial fetch operations. See - `rev-list` for possible "filter-spec" values. When communicating - with other processes, senders SHOULD translate scaled integers - (e.g. "1k") into a fully-expanded form (e.g. "1024") to aid - interoperability with older receivers that may not understand - newly-invented scaling suffixes. However, receivers SHOULD - accept the following suffixes: 'k', 'm', and 'g' for 1024, - 1048576, and 1073741824, respectively. - -If the 'ref-in-want' feature is advertised, the following argument can -be included in the client's request as well as the potential addition of -the 'wanted-refs' section in the server's response as explained below. - - want-ref <ref> - Indicates to the server that the client wants to retrieve a - particular ref, where <ref> is the full name of a ref on the - server. - -If the 'sideband-all' feature is advertised, the following argument can be -included in the client's request: - - sideband-all - Instruct the server to send the whole response multiplexed, not just - the packfile section. All non-flush and non-delim PKT-LINE in the - response (not only in the packfile section) will then start with a byte - indicating its sideband (1, 2, or 3), and the server may send "0005\2" - (a PKT-LINE of sideband 2 with no payload) as a keepalive packet. - -If the 'packfile-uris' feature is advertised, the following argument -can be included in the client's request as well as the potential -addition of the 'packfile-uris' section in the server's response as -explained below. - - packfile-uris <comma-separated list of protocols> - Indicates to the server that the client is willing to receive - URIs of any of the given protocols in place of objects in the - sent packfile. Before performing the connectivity check, the - client should download from all given URIs. Currently, the - protocols supported are "http" and "https". - -The response of `fetch` is broken into a number of sections separated by -delimiter packets (0001), with each section beginning with its section -header. Most sections are sent only when the packfile is sent. - - output = acknowledgements flush-pkt | - [acknowledgments delim-pkt] [shallow-info delim-pkt] - [wanted-refs delim-pkt] [packfile-uris delim-pkt] - packfile flush-pkt - - acknowledgments = PKT-LINE("acknowledgments" LF) - (nak | *ack) - (ready) - ready = PKT-LINE("ready" LF) - nak = PKT-LINE("NAK" LF) - ack = PKT-LINE("ACK" SP obj-id LF) - - shallow-info = PKT-LINE("shallow-info" LF) - *PKT-LINE((shallow | unshallow) LF) - shallow = "shallow" SP obj-id - unshallow = "unshallow" SP obj-id - - wanted-refs = PKT-LINE("wanted-refs" LF) - *PKT-LINE(wanted-ref LF) - wanted-ref = obj-id SP refname - - packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri - packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF) - - packfile = PKT-LINE("packfile" LF) - *PKT-LINE(%x01-03 *%x00-ff) - - acknowledgments section - * If the client determines that it is finished with negotiations by - sending a "done" line (thus requiring the server to send a packfile), - the acknowledgments sections MUST be omitted from the server's - response. - - * Always begins with the section header "acknowledgments" - - * The server will respond with "NAK" if none of the object ids sent - as have lines were common. - - * The server will respond with "ACK obj-id" for all of the - object ids sent as have lines which are common. - - * A response cannot have both "ACK" lines as well as a "NAK" - line. - - * The server will respond with a "ready" line indicating that - the server has found an acceptable common base and is ready to - make and send a packfile (which will be found in the packfile - section of the same response) - - * If the server has found a suitable cut point and has decided - to send a "ready" line, then the server can decide to (as an - optimization) omit any "ACK" lines it would have sent during - its response. This is because the server will have already - determined the objects it plans to send to the client and no - further negotiation is needed. - - shallow-info section - * If the client has requested a shallow fetch/clone, a shallow - client requests a fetch or the server is shallow then the - server's response may include a shallow-info section. The - shallow-info section will be included if (due to one of the - above conditions) the server needs to inform the client of any - shallow boundaries or adjustments to the clients already - existing shallow boundaries. - - * Always begins with the section header "shallow-info" - - * If a positive depth is requested, the server will compute the - set of commits which are no deeper than the desired depth. - - * The server sends a "shallow obj-id" line for each commit whose - parents will not be sent in the following packfile. - - * The server sends an "unshallow obj-id" line for each commit - which the client has indicated is shallow, but is no longer - shallow as a result of the fetch (due to its parents being - sent in the following packfile). - - * The server MUST NOT send any "unshallow" lines for anything - which the client has not indicated was shallow as a part of - its request. - - wanted-refs section - * This section is only included if the client has requested a - ref using a 'want-ref' line and if a packfile section is also - included in the response. - - * Always begins with the section header "wanted-refs". - - * The server will send a ref listing ("<oid> <refname>") for - each reference requested using 'want-ref' lines. - - * The server MUST NOT send any refs which were not requested - using 'want-ref' lines. - - packfile-uris section - * This section is only included if the client sent - 'packfile-uris' and the server has at least one such URI to - send. - - * Always begins with the section header "packfile-uris". - - * For each URI the server sends, it sends a hash of the pack's - contents (as output by git index-pack) followed by the URI. - - * The hashes are 40 hex characters long. When Git upgrades to a new - hash algorithm, this might need to be updated. (It should match - whatever index-pack outputs after "pack\t" or "keep\t". - - packfile section - * This section is only included if the client has sent 'want' - lines in its request and either requested that no more - negotiation be done by sending 'done' or if the server has - decided it has found a sufficient cut point to produce a - packfile. - - * Always begins with the section header "packfile" - - * The transmission of the packfile begins immediately after the - section header - - * The data transfer of the packfile is always multiplexed, using - the same semantics of the 'side-band-64k' capability from - protocol version 1. This means that each packet, during the - packfile data stream, is made up of a leading 4-byte pkt-line - length (typical of the pkt-line format), followed by a 1-byte - stream code, followed by the actual data. - - The stream code can be one of: - 1 - pack data - 2 - progress messages - 3 - fatal error message just before stream aborts - -server-option -~~~~~~~~~~~~~ - -If advertised, indicates that any number of server specific options can be -included in a request. This is done by sending each option as a -"server-option=<option>" capability line in the capability-list section of -a request. - -The provided options must not contain a NUL or LF character. - - object-format -~~~~~~~~~~~~~~~ - -The server can advertise the `object-format` capability with a value `X` (in the -form `object-format=X`) to notify the client that the server is able to deal -with objects using hash algorithm X. If not specified, the server is assumed to -only handle SHA-1. If the client would like to use a hash algorithm other than -SHA-1, it should specify its object-format string. diff --git a/third_party/git/Documentation/technical/racy-git.txt b/third_party/git/Documentation/technical/racy-git.txt deleted file mode 100644 index ceda4bbfda4d..000000000000 --- a/third_party/git/Documentation/technical/racy-git.txt +++ /dev/null @@ -1,201 +0,0 @@ -Use of index and Racy Git problem -================================= - -Background ----------- - -The index is one of the most important data structures in Git. -It represents a virtual working tree state by recording list of -paths and their object names and serves as a staging area to -write out the next tree object to be committed. The state is -"virtual" in the sense that it does not necessarily have to, and -often does not, match the files in the working tree. - -There are cases Git needs to examine the differences between the -virtual working tree state in the index and the files in the -working tree. The most obvious case is when the user asks `git -diff` (or its low level implementation, `git diff-files`) or -`git-ls-files --modified`. In addition, Git internally checks -if the files in the working tree are different from what are -recorded in the index to avoid stomping on local changes in them -during patch application, switching branches, and merging. - -In order to speed up this comparison between the files in the -working tree and the index entries, the index entries record the -information obtained from the filesystem via `lstat(2)` system -call when they were last updated. When checking if they differ, -Git first runs `lstat(2)` on the files and compares the result -with this information (this is what was originally done by the -`ce_match_stat()` function, but the current code does it in -`ce_match_stat_basic()` function). If some of these "cached -stat information" fields do not match, Git can tell that the -files are modified without even looking at their contents. - -Note: not all members in `struct stat` obtained via `lstat(2)` -are used for this comparison. For example, `st_atime` obviously -is not useful. Currently, Git compares the file type (regular -files vs symbolic links) and executable bits (only for regular -files) from `st_mode` member, `st_mtime` and `st_ctime` -timestamps, `st_uid`, `st_gid`, `st_ino`, and `st_size` members. -With a `USE_STDEV` compile-time option, `st_dev` is also -compared, but this is not enabled by default because this member -is not stable on network filesystems. With `USE_NSEC` -compile-time option, `st_mtim.tv_nsec` and `st_ctim.tv_nsec` -members are also compared. On Linux, this is not enabled by default -because in-core timestamps can have finer granularity than -on-disk timestamps, resulting in meaningless changes when an -inode is evicted from the inode cache. See commit 8ce13b0 -of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git -([PATCH] Sync in core time granularity with filesystems, -2005-01-04). This patch is included in kernel 2.6.11 and newer, but -only fixes the issue for file systems with exactly 1 ns or 1 s -resolution. Other file systems are still broken in current Linux -kernels (e.g. CEPH, CIFS, NTFS, UDF), see -https://lore.kernel.org/lkml/5577240D.7020309@gmail.com/ - -Racy Git --------- - -There is one slight problem with the optimization based on the -cached stat information. Consider this sequence: - - : modify 'foo' - $ git update-index 'foo' - : modify 'foo' again, in-place, without changing its size - -The first `update-index` computes the object name of the -contents of file `foo` and updates the index entry for `foo` -along with the `struct stat` information. If the modification -that follows it happens very fast so that the file's `st_mtime` -timestamp does not change, after this sequence, the cached stat -information the index entry records still exactly match what you -would see in the filesystem, even though the file `foo` is now -different. -This way, Git can incorrectly think files in the working tree -are unmodified even though they actually are. This is called -the "racy Git" problem (discovered by Pasky), and the entries -that appear clean when they may not be because of this problem -are called "racily clean". - -To avoid this problem, Git does two things: - -. When the cached stat information says the file has not been - modified, and the `st_mtime` is the same as (or newer than) - the timestamp of the index file itself (which is the time `git - update-index foo` finished running in the above example), it - also compares the contents with the object registered in the - index entry to make sure they match. - -. When the index file is updated that contains racily clean - entries, cached `st_size` information is truncated to zero - before writing a new version of the index file. - -Because the index file itself is written after collecting all -the stat information from updated paths, `st_mtime` timestamp of -it is usually the same as or newer than any of the paths the -index contains. And no matter how quick the modification that -follows `git update-index foo` finishes, the resulting -`st_mtime` timestamp on `foo` cannot get a value earlier -than the index file. Therefore, index entries that can be -racily clean are limited to the ones that have the same -timestamp as the index file itself. - -The callers that want to check if an index entry matches the -corresponding file in the working tree continue to call -`ce_match_stat()`, but with this change, `ce_match_stat()` uses -`ce_modified_check_fs()` to see if racily clean ones are -actually clean after comparing the cached stat information using -`ce_match_stat_basic()`. - -The problem the latter solves is this sequence: - - $ git update-index 'foo' - : modify 'foo' in-place without changing its size - : wait for enough time - $ git update-index 'bar' - -Without the latter, the timestamp of the index file gets a newer -value, and falsely clean entry `foo` would not be caught by the -timestamp comparison check done with the former logic anymore. -The latter makes sure that the cached stat information for `foo` -would never match with the file in the working tree, so later -checks by `ce_match_stat_basic()` would report that the index entry -does not match the file and Git does not have to fall back on more -expensive `ce_modified_check_fs()`. - - -Runtime penalty ---------------- - -The runtime penalty of falling back to `ce_modified_check_fs()` -from `ce_match_stat()` can be very expensive when there are many -racily clean entries. An obvious way to artificially create -this situation is to give the same timestamp to all the files in -the working tree in a large project, run `git update-index` on -them, and give the same timestamp to the index file: - - $ date >.datestamp - $ git ls-files | xargs touch -r .datestamp - $ git ls-files | git update-index --stdin - $ touch -r .datestamp .git/index - -This will make all index entries racily clean. The linux project, for -example, there are over 20,000 files in the working tree. On my -Athlon 64 X2 3800+, after the above: - - $ /usr/bin/time git diff-files - 1.68user 0.54system 0:02.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k - 0inputs+0outputs (0major+67111minor)pagefaults 0swaps - $ git update-index MAINTAINERS - $ /usr/bin/time git diff-files - 0.02user 0.12system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k - 0inputs+0outputs (0major+935minor)pagefaults 0swaps - -Running `git update-index` in the middle checked the racily -clean entries, and left the cached `st_mtime` for all the paths -intact because they were actually clean (so this step took about -the same amount of time as the first `git diff-files`). After -that, they are not racily clean anymore but are truly clean, so -the second invocation of `git diff-files` fully took advantage -of the cached stat information. - - -Avoiding runtime penalty ------------------------- - -In order to avoid the above runtime penalty, post 1.4.2 Git used -to have a code that made sure the index file -got timestamp newer than the youngest files in the index when -there are many young files with the same timestamp as the -resulting index file would otherwise would have by waiting -before finishing writing the index file out. - -I suspected that in practice the situation where many paths in the -index are all racily clean was quite rare. The only code paths -that can record recent timestamp for large number of paths are: - -. Initial `git add .` of a large project. - -. `git checkout` of a large project from an empty index into an - unpopulated working tree. - -Note: switching branches with `git checkout` keeps the cached -stat information of existing working tree files that are the -same between the current branch and the new branch, which are -all older than the resulting index file, and they will not -become racily clean. Only the files that are actually checked -out can become racily clean. - -In a large project where raciness avoidance cost really matters, -however, the initial computation of all object names in the -index takes more than one second, and the index file is written -out after all that happens. Therefore the timestamp of the -index file will be more than one seconds later than the -youngest file in the working tree. This means that in these -cases there actually will not be any racily clean entry in -the resulting index. - -Based on this discussion, the current code does not use the -"workaround" to avoid the runtime penalty that does not exist in -practice anymore. This was done with commit 0fc82cff on Aug 15, -2006. diff --git a/third_party/git/Documentation/technical/reftable.txt b/third_party/git/Documentation/technical/reftable.txt deleted file mode 100644 index 2951840e9c9b..000000000000 --- a/third_party/git/Documentation/technical/reftable.txt +++ /dev/null @@ -1,1083 +0,0 @@ -reftable --------- - -Overview -~~~~~~~~ - -Problem statement -^^^^^^^^^^^^^^^^^ - -Some repositories contain a lot of references (e.g. android at 866k, -rails at 31k). The existing packed-refs format takes up a lot of space -(e.g. 62M), and does not scale with additional references. Lookup of a -single reference requires linearly scanning the file. - -Atomic pushes modifying multiple references require copying the entire -packed-refs file, which can be a considerable amount of data moved -(e.g. 62M in, 62M out) for even small transactions (2 refs modified). - -Repositories with many loose references occupy a large number of disk -blocks from the local file system, as each reference is its own file -storing 41 bytes (and another file for the corresponding reflog). This -negatively affects the number of inodes available when a large number of -repositories are stored on the same filesystem. Readers can be penalized -due to the larger number of syscalls required to traverse and read the -`$GIT_DIR/refs` directory. - - -Objectives -^^^^^^^^^^ - -* Near constant time lookup for any single reference, even when the -repository is cold and not in process or kernel cache. -* Near constant time verification if an object name is referred to by at least -one reference (for allow-tip-sha1-in-want). -* Efficient enumeration of an entire namespace, such as `refs/tags/`. -* Support atomic push with `O(size_of_update)` operations. -* Combine reflog storage with ref storage for small transactions. -* Separate reflog storage for base refs and historical logs. - -Description -^^^^^^^^^^^ - -A reftable file is a portable binary file format customized for -reference storage. References are sorted, enabling linear scans, binary -search lookup, and range scans. - -Storage in the file is organized into variable sized blocks. Prefix -compression is used within a single block to reduce disk space. Block -size and alignment is tunable by the writer. - -Performance -^^^^^^^^^^^ - -Space used, packed-refs vs. reftable: - -[cols=",>,>,>,>,>",options="header",] -|=============================================================== -|repository |packed-refs |reftable |% original |avg ref |avg obj -|android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes -|rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes -|git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes -|git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes -|=============================================================== - -Scan (read 866k refs), by reference name lookup (single ref from 866k -refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs): - -[cols=",>,>,>,>",options="header",] -|========================================================= -|format |cache |scan |by name |by SHA-1 -|packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec -|packed-refs |hot | |6,844.6 usec |20,110.1 usec -|reftable |cold |112 ms |33.9 usec |323.2 usec -|reftable |hot | |20.2 usec |320.8 usec -|========================================================= - -Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable: - -[cols=",>,>",options="header",] -|================================ -|format |size |avg entry -|$GIT_DIR/logs |173 M |1209 bytes -|reftable |5 M |37 bytes -|================================ - -Details -~~~~~~~ - -Peeling -^^^^^^^ - -References stored in a reftable are peeled, a record for an annotated -(or signed) tag records both the tag object, and the object it refers -to. This is analogous to storage in the packed-refs format. - -Reference name encoding -^^^^^^^^^^^^^^^^^^^^^^^ - -Reference names are an uninterpreted sequence of bytes that must pass -linkgit:git-check-ref-format[1] as a valid reference name. - -Key unicity -^^^^^^^^^^^ - -Each entry must have a unique key; repeated keys are disallowed. - -Network byte order -^^^^^^^^^^^^^^^^^^ - -All multi-byte, fixed width fields are in network byte order. - -Varint encoding -^^^^^^^^^^^^^^^ - -Varint encoding is identical to the ofs-delta encoding method used -within pack files. - -Decoder works such as: - -.... -val = buf[ptr] & 0x7f -while (buf[ptr] & 0x80) { - ptr++ - val = ((val + 1) << 7) | (buf[ptr] & 0x7f) -} -.... - -Ordering -^^^^^^^^ - -Blocks are lexicographically ordered by their first reference. - -Directory/file conflicts -^^^^^^^^^^^^^^^^^^^^^^^^ - -The reftable format accepts both `refs/heads/foo` and -`refs/heads/foo/bar` as distinct references. - -This property is useful for retaining log records in reftable, but may -confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain -references. Users of reftable may choose to continue to reject `foo` and -`foo/bar` type conflicts to prevent problems for peers. - -File format -~~~~~~~~~~~ - -Structure -^^^^^^^^^ - -A reftable file has the following high-level structure: - -.... -first_block { - header - first_ref_block -} -ref_block* -ref_index* -obj_block* -obj_index* -log_block* -log_index* -footer -.... - -A log-only file omits the `ref_block`, `ref_index`, `obj_block` and -`obj_index` sections, containing only the file header and log block: - -.... -first_block { - header -} -log_block* -log_index* -footer -.... - -in a log-only file the first log block immediately follows the file -header, without padding to block alignment. - -Block size -^^^^^^^^^^ - -The file's block size is arbitrarily determined by the writer, and does -not have to be a power of 2. The block size must be larger than the -longest reference name or log entry used in the repository, as -references cannot span blocks. - -Powers of two that are friendly to the virtual memory system or -filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can -yield better compression, with a possible increased cost incurred by -readers during access. - -The largest block size is `16777215` bytes (15.99 MiB). - -Block alignment -^^^^^^^^^^^^^^^ - -Writers may choose to align blocks at multiples of the block size by -including `padding` filled with NUL bytes at the end of a block to round -out to the chosen alignment. When alignment is used, writers must -specify the alignment with the file header's `block_size` field. - -Block alignment is not required by the file format. Unaligned files must -set `block_size = 0` in the file header, and omit `padding`. Unaligned -files with more than one ref block must include the link:#Ref-index[ref -index] to support fast lookup. Readers must be able to read both aligned -and non-aligned files. - -Very small files (e.g. a single ref block) may omit `padding` and the ref -index to reduce total file size. - -Header (version 1) -^^^^^^^^^^^^^^^^^^ - -A 24-byte header appears at the beginning of the file: - -.... -'REFT' -uint8( version_number = 1 ) -uint24( block_size ) -uint64( min_update_index ) -uint64( max_update_index ) -.... - -Aligned files must specify `block_size` to configure readers with the -expected block alignment. Unaligned files must set `block_size = 0`. - -The `min_update_index` and `max_update_index` describe bounds for the -`update_index` field of all log records in this file. When reftables are -used in a stack for link:#Update-transactions[transactions], these -fields can order the files such that the prior file's -`max_update_index + 1` is the next file's `min_update_index`. - -Header (version 2) -^^^^^^^^^^^^^^^^^^ - -A 28-byte header appears at the beginning of the file: - -.... -'REFT' -uint8( version_number = 2 ) -uint24( block_size ) -uint64( min_update_index ) -uint64( max_update_index ) -uint32( hash_id ) -.... - -The header is identical to `version_number=1`, with the 4-byte hash ID -("sha1" for SHA1 and "s256" for SHA-256) append to the header. - -For maximum backward compatibility, it is recommended to use version 1 when -writing SHA1 reftables. - -First ref block -^^^^^^^^^^^^^^^ - -The first ref block shares the same block as the file header, and is 24 -bytes smaller than all other blocks in the file. The first block -immediately begins after the file header, at position 24. - -If the first block is a log block (a log-only file), its block header -begins immediately at position 24. - -Ref block format -^^^^^^^^^^^^^^^^ - -A ref block is written as: - -.... -'r' -uint24( block_len ) -ref_record+ -uint24( restart_offset )+ -uint16( restart_count ) - -padding? -.... - -Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which -encodes the number of bytes in the block up to, but not including the -optional `padding`. This is always less than or equal to the file's -block size. In the first ref block, `block_len` includes 24 bytes for -the file header. - -The 2-byte `restart_count` stores the number of entries in the -`restart_offset` list, which must not be empty. Readers can use -`restart_count` to binary search between restarts before starting a -linear scan. - -Exactly `restart_count` 3-byte `restart_offset` values precedes the -`restart_count`. Offsets are relative to the start of the block and -refer to the first byte of any `ref_record` whose name has not been -prefix compressed. Entries in the `restart_offset` list must be sorted, -ascending. Readers can start linear scans from any of these records. - -A variable number of `ref_record` fill the middle of the block, -describing reference names and values. The format is described below. - -As the first ref block shares the first file block with the file header, -all `restart_offset` in the first block are relative to the start of the -file (position 0), and include the file header. This forces the first -`restart_offset` to be `28`. - -ref record -++++++++++ - -A `ref_record` describes a single reference, storing both the name and -its value(s). Records are formatted as: - -.... -varint( prefix_length ) -varint( (suffix_length << 3) | value_type ) -suffix -varint( update_index_delta ) -value? -.... - -The `prefix_length` field specifies how many leading bytes of the prior -reference record's name should be copied to obtain this reference's -name. This must be 0 for the first reference in any block, and also must -be 0 for any `ref_record` whose offset is listed in the `restart_offset` -table at the end of the block. - -Recovering a reference name from any `ref_record` is a simple concat: - -.... -this_name = prior_name[0..prefix_length] + suffix -.... - -The `suffix_length` value provides the number of bytes available in -`suffix` to copy from `suffix` to complete the reference name. - -The `update_index` that last modified the reference can be obtained by -adding `update_index_delta` to the `min_update_index` from the file -header: `min_update_index + update_index_delta`. - -The `value` follows. Its format is determined by `value_type`, one of -the following: - -* `0x0`: deletion; no value data (see transactions, below) -* `0x1`: one object name; value of the ref -* `0x2`: two object names; value of the ref, peeled target -* `0x3`: symbolic reference: `varint( target_len ) target` - -Symbolic references use `0x3`, followed by the complete name of the -reference target. No compression is applied to the target name. - -Types `0x4..0x7` are reserved for future use. - -Ref index -^^^^^^^^^ - -The ref index stores the name of the last reference from every ref block -in the file, enabling reduced disk seeks for lookups. Any reference can -be found by searching the index, identifying the containing block, and -searching within that block. - -The index may be organized into a multi-level index, where the 1st level -index block points to additional ref index blocks (2nd level), which may -in turn point to either additional index blocks (e.g. 3rd level) or ref -blocks (leaf level). Disk reads required to access a ref go up with -higher index levels. Multi-level indexes may be required to ensure no -single index block exceeds the file format's max block size of -`16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for -lookups the index must be a single level, which is permitted to exceed -the file's configured block size, but not the format's max block size of -15.99 MiB. - -If present, the ref index block(s) appears after the last ref block. - -If there are at least 4 ref blocks, a ref index block should be written -to improve lookup times. Cold reads using the index require 2 disk reads -(read index, read block), and binary searching < 4 blocks also requires -<= 2 reads. Omitting the index block from smaller files saves space. - -If the file is unaligned and contains more than one ref block, the ref -index must be written. - -Index block format: - -.... -'i' -uint24( block_len ) -index_record+ -uint24( restart_offset )+ -uint16( restart_count ) - -padding? -.... - -The index blocks begin with `block_type = 'i'` and a 3-byte `block_len` -which encodes the number of bytes in the block, up to but not including -the optional `padding`. - -The `restart_offset` and `restart_count` fields are identical in format, -meaning and usage as in ref blocks. - -To reduce the number of reads required for random access in very large -files the index block may be larger than other blocks. However, readers -must hold the entire index in memory to benefit from this, so it's a -time-space tradeoff in both file size and reader memory. - -Increasing the file's block size decreases the index size. Alternatively -a multi-level index may be used, keeping index blocks within the file's -block size, but increasing the number of blocks that need to be -accessed. - -index record -++++++++++++ - -An index record describes the last entry in another block. Index records -are written as: - -.... -varint( prefix_length ) -varint( (suffix_length << 3) | 0 ) -suffix -varint( block_position ) -.... - -Index records use prefix compression exactly like `ref_record`. - -Index records store `block_position` after the suffix, specifying the -absolute position in bytes (from the start of the file) of the block -that ends with this reference. Readers can seek to `block_position` to -begin reading the block header. - -Readers must examine the block header at `block_position` to determine -if the next block is another level index block, or the leaf-level ref -block. - -Reading the index -+++++++++++++++++ - -Readers loading the ref index must first read the footer (below) to -obtain `ref_index_position`. If not present, the position will be 0. The -`ref_index_position` is for the 1st level root of the ref index. - -Obj block format -^^^^^^^^^^^^^^^^ - -Object blocks are optional. Writers may choose to omit object blocks, -especially if readers will not use the object name to ref mapping. - -Object blocks use unique, abbreviated 2-32 object name keys, mapping to -ref blocks containing references pointing to that object directly, or as -the peeled value of an annotated tag. Like ref blocks, object blocks use -the file's standard block size. The abbrevation length is available in -the footer as `obj_id_len`. - -To save space in small files, object blocks may be omitted if the ref -index is not present, as brute force search will only need to read a few -ref blocks. When missing, readers should brute force a linear search of -all references to lookup by object name. - -An object block is written as: - -.... -'o' -uint24( block_len ) -obj_record+ -uint24( restart_offset )+ -uint16( restart_count ) - -padding? -.... - -Fields are identical to ref block. Binary search using the restart table -works the same as in reference blocks. - -Because object names are abbreviated by writers to the shortest unique -abbreviation within the reftable, obj key lengths have a variable length. Their -length must be at least 2 bytes. Readers must compare only for common prefix -match within an obj block or obj index. - -obj record -++++++++++ - -An `obj_record` describes a single object abbreviation, and the blocks -containing references using that unique abbreviation: - -.... -varint( prefix_length ) -varint( (suffix_length << 3) | cnt_3 ) -suffix -varint( cnt_large )? -varint( position_delta )* -.... - -Like in reference blocks, abbreviations are prefix compressed within an -obj block. On large reftables with many unique objects, higher block -sizes (64k), and higher restart interval (128), a `prefix_length` of 2 -or 3 and `suffix_length` of 3 may be common in obj records (unique -abbreviation of 5-6 raw bytes, 10-12 hex digits). - -Each record contains `position_count` number of positions for matching -ref blocks. For 1-7 positions the count is stored in `cnt_3`. When -`cnt_3 = 0` the actual count follows in a varint, `cnt_large`. - -The use of `cnt_3` bets most objects are pointed to by only a single -reference, some may be pointed to by a couple of references, and very -few (if any) are pointed to by more than 7 references. - -A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no -`position_delta`, but at least one reference starts with this -abbreviation. A reader that needs exact reference names must scan all -references to find which specific references have the desired object. -Writers should use this format when the `position_delta` list would have -overflowed the file's block size due to a high number of references -pointing to the same object. - -The first `position_delta` is the position from the start of the file. -Additional `position_delta` entries are sorted ascending and relative to -the prior entry, e.g. a reader would perform: - -.... -pos = position_delta[0] -prior = pos -for (j = 1; j < position_count; j++) { - pos = prior + position_delta[j] - prior = pos -} -.... - -With a position in hand, a reader must linearly scan the ref block, -starting from the first `ref_record`, testing each reference's object names -(for `value_type = 0x1` or `0x2`) for full equality. Faster searching by -object name within a single ref block is not supported by the reftable format. -Smaller block sizes reduce the number of candidates this step must -consider. - -Obj index -^^^^^^^^^ - -The obj index stores the abbreviation from the last entry for every obj -block in the file, enabling reduced disk seeks for all lookups. It is -formatted exactly the same as the ref index, but refers to obj blocks. - -The obj index should be present if obj blocks are present, as obj blocks -should only be written in larger files. - -Readers loading the obj index must first read the footer (below) to -obtain `obj_index_position`. If not present, the position will be 0. - -Log block format -^^^^^^^^^^^^^^^^ - -Unlike ref and obj blocks, log blocks are always unaligned. - -Log blocks are variable in size, and do not match the `block_size` -specified in the file header or footer. Writers should choose an -appropriate buffer size to prepare a log block for deflation, such as -`2 * block_size`. - -A log block is written as: - -.... -'g' -uint24( block_len ) -zlib_deflate { - log_record+ - uint24( restart_offset )+ - uint16( restart_count ) -} -.... - -Log blocks look similar to ref blocks, except `block_type = 'g'`. - -The 4-byte block header is followed by the deflated block contents using -zlib deflate. The `block_len` in the header is the inflated size -(including 4-byte block header), and should be used by readers to -preallocate the inflation output buffer. A log block's `block_len` may -exceed the file's block size. - -Offsets within the log block (e.g. `restart_offset`) still include the -4-byte header. Readers may prefer prefixing the inflation output buffer -with the 4-byte header. - -Within the deflate container, a variable number of `log_record` describe -reference changes. The log record format is described below. See ref -block format (above) for a description of `restart_offset` and -`restart_count`. - -Because log blocks have no alignment or padding between blocks, readers -must keep track of the bytes consumed by the inflater to know where the -next log block begins. - -log record -++++++++++ - -Log record keys are structured as: - -.... -ref_name '\0' reverse_int64( update_index ) -.... - -where `update_index` is the unique transaction identifier. The -`update_index` field must be unique within the scope of a `ref_name`. -See the update transactions section below for further details. - -The `reverse_int64` function inverses the value so lexicographical -ordering the network byte order encoding sorts the more recent records -with higher `update_index` values first: - -.... -reverse_int64(int64 t) { - return 0xffffffffffffffff - t; -} -.... - -Log records have a similar starting structure to ref and index records, -utilizing the same prefix compression scheme applied to the log record -key described above. - -.... - varint( prefix_length ) - varint( (suffix_length << 3) | log_type ) - suffix - log_data { - old_id - new_id - varint( name_length ) name - varint( email_length ) email - varint( time_seconds ) - sint16( tz_offset ) - varint( message_length ) message - }? -.... - -Log record entries use `log_type` to indicate what follows: - -* `0x0`: deletion; no log data. -* `0x1`: standard git reflog data using `log_data` above. - -The `log_type = 0x0` is mostly useful for `git stash drop`, removing an -entry from the reflog of `refs/stash` in a transaction file (below), -without needing to rewrite larger files. Readers reading a stack of -reflogs must treat this as a deletion. - -For `log_type = 0x1`, the `log_data` section follows -linkgit:git-update-ref[1] logging and includes: - -* two object names (old id, new id) -* varint string of committer's name -* varint string of committer's email -* varint time in seconds since epoch (Jan 1, 1970) -* 2-byte timezone offset in minutes (signed) -* varint string of message - -`tz_offset` is the absolute number of minutes from GMT the committer was -at the time of the update. For example `GMT-0800` is encoded in reftable -as `sint16(-480)` and `GMT+0230` is `sint16(150)`. - -The committer email does not contain `<` or `>`, it's the value normally -found between the `<>` in a git commit object header. - -The `message_length` may be 0, in which case there was no message -supplied for the update. - -Contrary to traditional reflog (which is a file), renames are encoded as -a combination of ref deletion and ref creation. A deletion is a log -record with a zero new_id, and a creation is a log record with a zero old_id. - -Reading the log -+++++++++++++++ - -Readers accessing the log must first read the footer (below) to -determine the `log_position`. The first block of the log begins at -`log_position` bytes since the start of the file. The `log_position` is -not block aligned. - -Importing logs -++++++++++++++ - -When importing from `$GIT_DIR/logs` writers should globally order all -log records roughly by timestamp while preserving file order, and assign -unique, increasing `update_index` values for each log line. Newer log -records get higher `update_index` values. - -Although an import may write only a single reftable file, the reftable -file must span many unique `update_index`, as each log line requires its -own `update_index` to preserve semantics. - -Log index -^^^^^^^^^ - -The log index stores the log key -(`refname \0 reverse_int64(update_index)`) for the last log record of -every log block in the file, supporting bounded-time lookup. - -A log index block must be written if 2 or more log blocks are written to -the file. If present, the log index appears after the last log block. -There is no padding used to align the log index to block alignment. - -Log index format is identical to ref index, except the keys are 9 bytes -longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`. -Records use `block_position` to refer to the start of a log block. - -Reading the index -+++++++++++++++++ - -Readers loading the log index must first read the footer (below) to -obtain `log_index_position`. If not present, the position will be 0. - -Footer -^^^^^^ - -After the last block of the file, a file footer is written. It begins -like the file header, but is extended with additional data. - -.... - HEADER - - uint64( ref_index_position ) - uint64( (obj_position << 5) | obj_id_len ) - uint64( obj_index_position ) - - uint64( log_position ) - uint64( log_index_position ) - - uint32( CRC-32 of above ) -.... - -If a section is missing (e.g. ref index) the corresponding position -field (e.g. `ref_index_position`) will be 0. - -* `obj_position`: byte position for the first obj block. -* `obj_id_len`: number of bytes used to abbreviate object names in -obj blocks. -* `log_position`: byte position for the first log block. -* `ref_index_position`: byte position for the start of the ref index. -* `obj_index_position`: byte position for the start of the obj index. -* `log_index_position`: byte position for the start of the log index. - -The size of the footer is 68 bytes for version 1, and 72 bytes for -version 2. - -Reading the footer -++++++++++++++++++ - -Readers must first read the file start to determine the version -number. Then they seek to `file_length - FOOTER_LENGTH` to access the -footer. A trusted external source (such as `stat(2)`) is necessary to -obtain `file_length`. When reading the footer, readers must verify: - -* 4-byte magic is correct -* 1-byte version number is recognized -* 4-byte CRC-32 matches the other 64 bytes (including magic, and -version) - -Once verified, the other fields of the footer can be accessed. - -Empty tables -++++++++++++ - -A reftable may be empty. In this case, the file starts with a header -and is immediately followed by a footer. - -Binary search -^^^^^^^^^^^^^ - -Binary search within a block is supported by the `restart_offset` fields -at the end of the block. Readers can binary search through the restart -table to locate between which two restart points the sought reference or -key should appear. - -Each record identified by a `restart_offset` stores the complete key in -the `suffix` field of the record, making the compare operation during -binary search straightforward. - -Once a restart point lexicographically before the sought reference has -been identified, readers can linearly scan through the following record -entries to locate the sought record, terminating if the current record -sorts after (and therefore the sought key is not present). - -Restart point selection -+++++++++++++++++++++++ - -Writers determine the restart points at file creation. The process is -arbitrary, but every 16 or 64 records is recommended. Every 16 may be -more suitable for smaller block sizes (4k or 8k), every 64 for larger -block sizes (64k). - -More frequent restart points reduces prefix compression and increases -space consumed by the restart table, both of which increase file size. - -Less frequent restart points makes prefix compression more effective, -decreasing overall file size, with increased penalties for readers -walking through more records after the binary search step. - -A maximum of `65535` restart points per block is supported. - -Considerations -~~~~~~~~~~~~~~ - -Lightweight refs dominate -^^^^^^^^^^^^^^^^^^^^^^^^^ - -The reftable format assumes the vast majority of references are single -object names valued with common prefixes, such as Gerrit Code Review's -`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many -lightweight tags in the `refs/tags/` namespace. - -Annotated tags storing the peeled object cost an additional object name per -reference. - -Low overhead -^^^^^^^^^^^^ - -A reftable with very few references (e.g. git.git with 5 heads) is 269 -bytes for reftable, vs. 332 bytes for packed-refs. This supports -reftable scaling down for transaction logs (below). - -Block size -^^^^^^^^^^ - -For a Gerrit Code Review type repository with many change refs, larger -block sizes (64 KiB) and less frequent restart points (every 64) yield -better compression due to more references within the block compressing -against the prior reference. - -Larger block sizes reduce the index size, as the reftable will require -fewer blocks to store the same number of references. - -Minimal disk seeks -^^^^^^^^^^^^^^^^^^ - -Assuming the index block has been loaded into memory, binary searching -for any single reference requires exactly 1 disk seek to load the -containing block. - -Scans and lookups dominate -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Scanning all references and lookup by name (or namespace such as -`refs/heads/`) are the most common activities performed on repositories. -Object names are stored directly with references to optimize this use case. - -Logs are infrequently read -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Logs are infrequently accessed, but can be large. Deflating log blocks -saves disk space, with some increased penalty at read time. - -Logs are stored in an isolated section from refs, reducing the burden on -reference readers that want to ignore logs. Further, historical logs can -be isolated into log-only files. - -Logs are read backwards -^^^^^^^^^^^^^^^^^^^^^^^ - -Logs are frequently accessed backwards (most recent N records for master -to answer `master@{4}`), so log records are grouped by reference, and -sorted descending by update index. - -Repository format -~~~~~~~~~~~~~~~~~ - -Version 1 -^^^^^^^^^ - -A repository must set its `$GIT_DIR/config` to configure reftable: - -.... -[core] - repositoryformatversion = 1 -[extensions] - refStorage = reftable -.... - -Layout -^^^^^^ - -A collection of reftable files are stored in the `$GIT_DIR/reftable/` -directory: - -.... -00000001-00000001.log -00000002-00000002.ref -00000003-00000003.ref -.... - -where reftable files are named by a unique name such as produced by the -function `${min_update_index}-${max_update_index}.ref`. - -Log-only files use the `.log` extension, while ref-only and mixed ref -and log files use `.ref`. extension. - -The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the -current files, one per line, in order, from oldest (base) to newest -(most recent): - -.... -$ cat .git/reftable/tables.list -00000001-00000001.log -00000002-00000002.ref -00000003-00000003.ref -.... - -Readers must read `$GIT_DIR/reftable/tables.list` to determine which -files are relevant right now, and search through the stack in reverse -order (last reftable is examined first). - -Reftable files not listed in `tables.list` may be new (and about to be -added to the stack by the active writer), or ancient and ready to be -pruned. - -Backward compatibility -^^^^^^^^^^^^^^^^^^^^^^ - -Older clients should continue to recognize the directory as a git -repository so they don't look for an enclosing repository in parent -directories. To this end, a reftable-enabled repository must contain the -following dummy files - -* `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`. -* `.git/refs/`, a directory -* `.git/refs/heads`, a regular file - -Readers -^^^^^^^ - -Readers can obtain a consistent snapshot of the reference space by -following: - -1. Open and read the `tables.list` file. -2. Open each of the reftable files that it mentions. -3. If any of the files is missing, goto 1. -4. Read from the now-open files as long as necessary. - -Update transactions -^^^^^^^^^^^^^^^^^^^ - -Although reftables are immutable, mutations are supported by writing a -new reftable and atomically appending it to the stack: - -1. Acquire `tables.list.lock`. -2. Read `tables.list` to determine current reftables. -3. Select `update_index` to be most recent file's -`max_update_index + 1`. -4. Prepare temp reftable `tmp_XXXXXX`, including log entries. -5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}.ref`. -6. Copy `tables.list` to `tables.list.lock`, appending file from (5). -7. Rename `tables.list.lock` to `tables.list`. - -During step 4 the new file's `min_update_index` and `max_update_index` -are both set to the `update_index` selected by step 3. All log records -for the transaction use the same `update_index` in their keys. This -enables later correlation of which references were updated by the same -transaction. - -Because a single `tables.list.lock` file is used to manage locking, the -repository is single-threaded for writers. Writers may have to busy-spin -(with backoff) around creating `tables.list.lock`, for up to an -acceptable wait period, aborting if the repository is too busy to -mutate. Application servers wrapped around repositories (e.g. Gerrit -Code Review) can layer their own lock/wait queue to improve fairness to -writers. - -Reference deletions -^^^^^^^^^^^^^^^^^^^ - -Deletion of any reference can be explicitly stored by setting the `type` -to `0x0` and omitting the `value` field of the `ref_record`. This serves -as a tombstone, overriding any assertions about the existence of the -reference from earlier files in the stack. - -Compaction -^^^^^^^^^^ - -A partial stack of reftables can be compacted by merging references -using a straightforward merge join across reftables, selecting the most -recent value for output, and omitting deleted references that do not -appear in remaining, lower reftables. - -A compacted reftable should set its `min_update_index` to the smallest -of the input files' `min_update_index`, and its `max_update_index` -likewise to the largest input `max_update_index`. - -For sake of illustration, assume the stack currently consists of -reftable files (from oldest to newest): A, B, C, and D. The compactor is -going to compact B and C, leaving A and D alone. - -1. Obtain lock `tables.list.lock` and read the `tables.list` file. -2. Obtain locks `B.lock` and `C.lock`. Ownership of these locks -prevents other processes from trying to compact these files. -3. Release `tables.list.lock`. -4. Compact `B` and `C` into a temp file -`${min_update_index}-${max_update_index}_XXXXXX`. -5. Reacquire lock `tables.list.lock`. -6. Verify that `B` and `C` are still in the stack, in that order. This -should always be the case, assuming that other processes are adhering to -the locking protocol. -7. Rename `${min_update_index}-${max_update_index}_XXXXXX` to -`${min_update_index}-${max_update_index}.ref`. -8. Write the new stack to `tables.list.lock`, replacing `B` and `C` -with the file from (4). -9. Rename `tables.list.lock` to `tables.list`. -10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing -readers to backtrack. - -This strategy permits compactions to proceed independently of updates. - -Each reftable (compacted or not) is uniquely identified by its name, so -open reftables can be cached by their name. - -Alternatives considered -~~~~~~~~~~~~~~~~~~~~~~~ - -bzip packed-refs -^^^^^^^^^^^^^^^^ - -`bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB -compresses to 23 MiB, 37%). However the bzip format does not support -random access to a single reference. Readers must inflate and discard -while performing a linear scan. - -Breaking packed-refs into chunks (individually compressing each chunk) -would reduce the amount of data a reader must inflate, but still leaves -the problem of indexing chunks to support readers efficiently locating -the correct chunk. - -Given the compression achieved by reftable's encoding, it does not seem -necessary to add the complexity of bzip/gzip/zlib. - -Michael Haggerty's alternate format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Michael Haggerty proposed -link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an -alternate] format to reftable on the Git mailing list. This format uses -smaller chunks, without the restart table, and avoids block alignment -with padding. Reflog entries immediately follow each ref, and are thus -interleaved between refs. - -Performance testing indicates reftable is faster for lookups (51% -faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly -larger file (+ ~3.2%, 28.3M vs 29.2M): - -[cols=">,>,>,>",options="header",] -|===================================== -|format |size |seek cold |seek hot -|mh-alt |28.3 M |23.4 usec |11.2 usec -|reftable |29.2 M |19.9 usec |5.4 usec -|===================================== - -JGit Ketch RefTree -^^^^^^^^^^^^^^^^^^ - -https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch] -proposed -link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree], -an encoding of references inside Git tree objects stored as part of the -repository's object database. - -The RefTree format adds additional load on the object database storage -layer (more loose objects, more objects in packs), and relies heavily on -the packer's delta compression to save space. Namespaces which are flat -(e.g. thousands of tags in refs/tags) initially create very large loose -objects, and so RefTree does not address the problem of copying many -references to modify a handful. - -Flat namespaces are not efficiently searchable in RefTree, as tree -objects in canonical formatting cannot be binary searched. This fails -the need to handle a large number of references in a single namespace, -such as GitHub's `refs/pulls`, or a project with many tags. - -LMDB -^^^^ - -David Turner proposed -https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using -LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible -license. - -A downside of LMDB is its reliance on a single C implementation. This -makes embedding inside JGit (a popular reimplementation of Git) -difficult, and hoisting onto virtual storage (for JGit DFS) virtually -impossible. - -A common format that can be supported by all major Git implementations -(git-core, JGit, libgit2) is strongly preferred. diff --git a/third_party/git/Documentation/technical/repository-version.txt b/third_party/git/Documentation/technical/repository-version.txt deleted file mode 100644 index 7844ef30ffde..000000000000 --- a/third_party/git/Documentation/technical/repository-version.txt +++ /dev/null @@ -1,102 +0,0 @@ -== Git Repository Format Versions - -Every git repository is marked with a numeric version in the -`core.repositoryformatversion` key of its `config` file. This version -specifies the rules for operating on the on-disk repository data. An -implementation of git which does not understand a particular version -advertised by an on-disk repository MUST NOT operate on that repository; -doing so risks not only producing wrong results, but actually losing -data. - -Because of this rule, version bumps should be kept to an absolute -minimum. Instead, we generally prefer these strategies: - - - bumping format version numbers of individual data files (e.g., - index, packfiles, etc). This restricts the incompatibilities only to - those files. - - - introducing new data that gracefully degrades when used by older - clients (e.g., pack bitmap files are ignored by older clients, which - simply do not take advantage of the optimization they provide). - -A whole-repository format version bump should only be part of a change -that cannot be independently versioned. For instance, if one were to -change the reachability rules for objects, or the rules for locking -refs, that would require a bump of the repository format version. - -Note that this applies only to accessing the repository's disk contents -directly. An older client which understands only format `0` may still -connect via `git://` to a repository using format `1`, as long as the -server process understands format `1`. - -The preferred strategy for rolling out a version bump (whether whole -repository or for a single file) is to teach git to read the new format, -and allow writing the new format with a config switch or command line -option (for experimentation or for those who do not care about backwards -compatibility with older gits). Then after a long period to allow the -reading capability to become common, we may switch to writing the new -format by default. - -The currently defined format versions are: - -=== Version `0` - -This is the format defined by the initial version of git, including but -not limited to the format of the repository directory, the repository -configuration file, and the object and ref storage. Specifying the -complete behavior of git is beyond the scope of this document. - -=== Version `1` - -This format is identical to version `0`, with the following exceptions: - - 1. When reading the `core.repositoryformatversion` variable, a git - implementation which supports version 1 MUST also read any - configuration keys found in the `extensions` section of the - configuration file. - - 2. If a version-1 repository specifies any `extensions.*` keys that - the running git has not implemented, the operation MUST NOT - proceed. Similarly, if the value of any known key is not understood - by the implementation, the operation MUST NOT proceed. - -Note that if no extensions are specified in the config file, then -`core.repositoryformatversion` SHOULD be set to `0` (setting it to `1` -provides no benefit, and makes the repository incompatible with older -implementations of git). - -This document will serve as the master list for extensions. Any -implementation wishing to define a new extension should make a note of -it here, in order to claim the name. - -The defined extensions are: - -==== `noop` - -This extension does not change git's behavior at all. It is useful only -for testing format-1 compatibility. - -==== `preciousObjects` - -When the config key `extensions.preciousObjects` is set to `true`, -objects in the repository MUST NOT be deleted (e.g., by `git-prune` or -`git repack -d`). - -==== `partialclone` - -When the config key `extensions.partialclone` is set, it indicates -that the repo was created with a partial clone (or later performed -a partial fetch) and that the remote may have omitted sending -certain unwanted objects. Such a remote is called a "promisor remote" -and it promises that all such omitted objects can be fetched from it -in the future. - -The value of this key is the name of the promisor remote. - -==== `worktreeConfig` - -If set, by default "git config" reads from both "config" and -"config.worktree" file from GIT_DIR in that order. In -multiple working directory mode, "config" file is shared while -"config.worktree" is per-working directory (i.e., it's in -GIT_COMMON_DIR/worktrees/<id>/config.worktree) diff --git a/third_party/git/Documentation/technical/rerere.txt b/third_party/git/Documentation/technical/rerere.txt deleted file mode 100644 index af5f9fc24f93..000000000000 --- a/third_party/git/Documentation/technical/rerere.txt +++ /dev/null @@ -1,186 +0,0 @@ -Rerere -====== - -This document describes the rerere logic. - -Conflict normalization ----------------------- - -To ensure recorded conflict resolutions can be looked up in the rerere -database, even when branches are merged in a different order, -different branches are merged that result in the same conflict, or -when different conflict style settings are used, rerere normalizes the -conflicts before writing them to the rerere database. - -Different conflict styles and branch names are normalized by stripping -the labels from the conflict markers, and removing the common ancestor -version from the `diff3` conflict style. Branches that are merged -in different order are normalized by sorting the conflict hunks. More -on each of those steps in the following sections. - -Once these two normalization operations are applied, a conflict ID is -calculated based on the normalized conflict, which is later used by -rerere to look up the conflict in the rerere database. - -Removing the common ancestor version -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Say we have three branches AB, AC and AC2. The common ancestor of -these branches has a file with a line containing the string "A" (for -brevity this is called "line A" in the rest of the document). In -branch AB this line is changed to "B", in AC, this line is changed to -"C", and branch AC2 is forked off of AC, after the line was changed to -"C". - -Forking a branch ABAC off of branch AB and then merging AC into it, we -get a conflict like the following: - - <<<<<<< HEAD - B - ======= - C - >>>>>>> AC - -Doing the analogous with AC2 (forking a branch ABAC2 off of branch AB -and then merging branch AC2 into it), using the diff3 conflict style, -we get a conflict like the following: - - <<<<<<< HEAD - B - ||||||| merged common ancestors - A - ======= - C - >>>>>>> AC2 - -By resolving this conflict, to leave line D, the user declares: - - After examining what branches AB and AC did, I believe that making - line A into line D is the best thing to do that is compatible with - what AB and AC wanted to do. - -As branch AC2 refers to the same commit as AC, the above implies that -this is also compatible what AB and AC2 wanted to do. - -By extension, this means that rerere should recognize that the above -conflicts are the same. To do this, the labels on the conflict -markers are stripped, and the common ancestor version is removed. The above -examples would both result in the following normalized conflict: - - <<<<<<< - B - ======= - C - >>>>>>> - -Sorting hunks -~~~~~~~~~~~~~ - -As before, lets imagine that a common ancestor had a file with line A -its early part, and line X in its late part. And then four branches -are forked that do these things: - - - AB: changes A to B - - AC: changes A to C - - XY: changes X to Y - - XZ: changes X to Z - -Now, forking a branch ABAC off of branch AB and then merging AC into -it, and forking a branch ACAB off of branch AC and then merging AB -into it, would yield the conflict in a different order. The former -would say "A became B or C, what now?" while the latter would say "A -became C or B, what now?" - -As a reminder, the act of merging AC into ABAC and resolving the -conflict to leave line D means that the user declares: - - After examining what branches AB and AC did, I believe that - making line A into line D is the best thing to do that is - compatible with what AB and AC wanted to do. - -So the conflict we would see when merging AB into ACAB should be -resolved the same way---it is the resolution that is in line with that -declaration. - -Imagine that similarly previously a branch XYXZ was forked from XY, -and XZ was merged into it, and resolved "X became Y or Z" into "X -became W". - -Now, if a branch ABXY was forked from AB and then merged XY, then ABXY -would have line B in its early part and line Y in its later part. -Such a merge would be quite clean. We can construct 4 combinations -using these four branches ((AB, AC) x (XY, XZ)). - -Merging ABXY and ACXZ would make "an early A became B or C, a late X -became Y or Z" conflict, while merging ACXY and ABXZ would make "an -early A became C or B, a late X became Y or Z". We can see there are -4 combinations of ("B or C", "C or B") x ("X or Y", "Y or X"). - -By sorting, the conflict is given its canonical name, namely, "an -early part became B or C, a late part became X or Y", and whenever -any of these four patterns appear, and we can get to the same conflict -and resolution that we saw earlier. - -Without the sorting, we'd have to somehow find a previous resolution -from combinatorial explosion. - -Conflict ID calculation -~~~~~~~~~~~~~~~~~~~~~~~ - -Once the conflict normalization is done, the conflict ID is calculated -as the sha1 hash of the conflict hunks appended to each other, -separated by <NUL> characters. The conflict markers are stripped out -before the sha1 is calculated. So in the example above, where we -merge branch AC which changes line A to line C, into branch AB, which -changes line A to line C, the conflict ID would be -SHA1('B<NUL>C<NUL>'). - -If there are multiple conflicts in one file, the sha1 is calculated -the same way with all hunks appended to each other, in the order in -which they appear in the file, separated by a <NUL> character. - -Nested conflicts -~~~~~~~~~~~~~~~~ - -Nested conflicts are handled very similarly to "simple" conflicts. -Similar to simple conflicts, the conflict is first normalized by -stripping the labels from conflict markers, stripping the common ancestor -version, and the sorting the conflict hunks, both for the outer and the -inner conflict. This is done recursively, so any number of nested -conflicts can be handled. - -Note that this only works for conflict markers that "cleanly nest". If -there are any unmatched conflict markers, rerere will fail to handle -the conflict and record a conflict resolution. - -The only difference is in how the conflict ID is calculated. For the -inner conflict, the conflict markers themselves are not stripped out -before calculating the sha1. - -Say we have the following conflict for example: - - <<<<<<< HEAD - 1 - ======= - <<<<<<< HEAD - 3 - ======= - 2 - >>>>>>> branch-2 - >>>>>>> branch-3~ - -After stripping out the labels of the conflict markers, and sorting -the hunks, the conflict would look as follows: - - <<<<<<< - 1 - ======= - <<<<<<< - 2 - ======= - 3 - >>>>>>> - >>>>>>> - -and finally the conflict ID would be calculated as: -`sha1('1<NUL><<<<<<<\n3\n=======\n2\n>>>>>>><NUL>')` diff --git a/third_party/git/Documentation/technical/send-pack-pipeline.txt b/third_party/git/Documentation/technical/send-pack-pipeline.txt deleted file mode 100644 index 9b5a0bc18667..000000000000 --- a/third_party/git/Documentation/technical/send-pack-pipeline.txt +++ /dev/null @@ -1,63 +0,0 @@ -Git-send-pack internals -======================= - -Overall operation ------------------ - -. Connects to the remote side and invokes git-receive-pack. - -. Learns what refs the remote has and what commit they point at. - Matches them to the refspecs we are pushing. - -. Checks if there are non-fast-forwards. Unlike fetch-pack, - the repository send-pack runs in is supposed to be a superset - of the recipient in fast-forward cases, so there is no need - for want/have exchanges, and fast-forward check can be done - locally. Tell the result to the other end. - -. Calls pack_objects() which generates a packfile and sends it - over to the other end. - -. If the remote side is new enough (v1.1.0 or later), wait for - the unpack and hook status from the other end. - -. Exit with appropriate error codes. - - -Pack_objects pipeline ---------------------- - -This function gets one file descriptor (`fd`) which is either a -socket (over the network) or a pipe (local). What's written to -this fd goes to git-receive-pack to be unpacked. - - send-pack ---> fd ---> receive-pack - -The function pack_objects creates a pipe and then forks. The -forked child execs pack-objects with --revs to receive revision -parameters from its standard input. This process will write the -packfile to the other end. - - send-pack - | - pack_objects() ---> fd ---> receive-pack - | ^ (pipe) - v | - (child) - -The child dup2's to arrange its standard output to go back to -the other end, and read its standard input to come from the -pipe. After that it exec's pack-objects. On the other hand, -the parent process, before starting to feed the child pipeline, -closes the reading side of the pipe and fd to receive-pack. - - send-pack - | - pack_objects(parent) - | - v [0] - pack-objects [0] ---> receive-pack - - -[jc: the pipeline was much more complex and needed documentation before - I understood an earlier bug, but now it is trivial and straightforward.] diff --git a/third_party/git/Documentation/technical/shallow.txt b/third_party/git/Documentation/technical/shallow.txt deleted file mode 100644 index f3738baa0f05..000000000000 --- a/third_party/git/Documentation/technical/shallow.txt +++ /dev/null @@ -1,60 +0,0 @@ -Shallow commits -=============== - -.Definition -********************************************************* -Shallow commits do have parents, but not in the shallow -repo, and therefore grafts are introduced pretending that -these commits have no parents. -********************************************************* - -$GIT_DIR/shallow lists commit object names and tells Git to -pretend as if they are root commits (e.g. "git log" traversal -stops after showing them; "git fsck" does not complain saying -the commits listed on their "parent" lines do not exist). - -Each line contains exactly one object name. When read, a commit_graft -will be constructed, which has nr_parent < 0 to make it easier -to discern from user provided grafts. - -Note that the shallow feature could not be changed easily to -use replace refs: a commit containing a `mergetag` is not allowed -to be replaced, not even by a root commit. Such a commit can be -made shallow, though. Also, having a `shallow` file explicitly -listing all the commits made shallow makes it a *lot* easier to -do shallow-specific things such as to deepen the history. - -Since fsck-objects relies on the library to read the objects, -it honours shallow commits automatically. - -There are some unfinished ends of the whole shallow business: - -- maybe we have to force non-thin packs when fetching into a - shallow repo (ATM they are forced non-thin). - -- A special handling of a shallow upstream is needed. At some - stage, upload-pack has to check if it sends a shallow commit, - and it should send that information early (or fail, if the - client does not support shallow repositories). There is no - support at all for this in this patch series. - -- Instead of locking $GIT_DIR/shallow at the start, just - the timestamp of it is noted, and when it comes to writing it, - a check is performed if the mtime is still the same, dying if - it is not. - -- It is unclear how "push into/from a shallow repo" should behave. - -- If you deepen a history, you'd want to get the tags of the - newly stored (but older!) commits. This does not work right now. - -To make a shallow clone, you can call "git-clone --depth 20 repo". -The result contains only commit chains with a length of at most 20. -It also writes an appropriate $GIT_DIR/shallow. - -You can deepen a shallow repository with "git-fetch --depth 20 -repo branch", which will fetch branch from repo, but stop at depth -20, updating $GIT_DIR/shallow. - -The special depth 2147483647 (or 0x7fffffff, the largest positive -number a signed 32-bit integer can contain) means infinite depth. diff --git a/third_party/git/Documentation/technical/signature-format.txt b/third_party/git/Documentation/technical/signature-format.txt deleted file mode 100644 index 2c9406a56a88..000000000000 --- a/third_party/git/Documentation/technical/signature-format.txt +++ /dev/null @@ -1,186 +0,0 @@ -Git signature format -==================== - -== Overview - -Git uses cryptographic signatures in various places, currently objects (tags, -commits, mergetags) and transactions (pushes). In every case, the command which -is about to create an object or transaction determines a payload from that, -calls gpg to obtain a detached signature for the payload (`gpg -bsa`) and -embeds the signature into the object or transaction. - -Signatures always begin with `-----BEGIN PGP SIGNATURE-----` -and end with `-----END PGP SIGNATURE-----`, unless gpg is told to -produce RFC1991 signatures which use `MESSAGE` instead of `SIGNATURE`. - -The signed payload and the way the signature is embedded depends -on the type of the object resp. transaction. - -== Tag signatures - -- created by: `git tag -s` -- payload: annotated tag object -- embedding: append the signature to the unsigned tag object -- example: tag `signedtag` with subject `signed tag` - ----- -object 04b871796dc0420f8e7561a895b52484b701d51a -type commit -tag signedtag -tagger C O Mitter <committer@example.com> 1465981006 +0000 - -signed tag - -signed tag message body ------BEGIN PGP SIGNATURE----- -Version: GnuPG v1 - -iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn -rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh -8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods -q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0 -rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x -lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E= -=jpXa ------END PGP SIGNATURE----- ----- - -- verify with: `git verify-tag [-v]` or `git tag -v` - ----- -gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189 -gpg: Good signature from "Eris Discordia <discord@example.net>" -gpg: WARNING: This key is not certified with a trusted signature! -gpg: There is no indication that the signature belongs to the owner. -Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 -object 04b871796dc0420f8e7561a895b52484b701d51a -type commit -tag signedtag -tagger C O Mitter <committer@example.com> 1465981006 +0000 - -signed tag - -signed tag message body ----- - -== Commit signatures - -- created by: `git commit -S` -- payload: commit object -- embedding: header entry `gpgsig` - (content is preceded by a space) -- example: commit with subject `signed commit` - ----- -tree eebfed94e75e7760540d1485c740902590a00332 -parent 04b871796dc0420f8e7561a895b52484b701d51a -author A U Thor <author@example.com> 1465981137 +0000 -committer C O Mitter <committer@example.com> 1465981137 +0000 -gpgsig -----BEGIN PGP SIGNATURE----- - Version: GnuPG v1 - - iQEcBAABAgAGBQJXYRjRAAoJEGEJLoW3InGJ3IwIAIY4SA6GxY3BjL60YyvsJPh/ - HRCJwH+w7wt3Yc/9/bW2F+gF72kdHOOs2jfv+OZhq0q4OAN6fvVSczISY/82LpS7 - DVdMQj2/YcHDT4xrDNBnXnviDO9G7am/9OE77kEbXrp7QPxvhjkicHNwy2rEflAA - zn075rtEERDHr8nRYiDh8eVrefSO7D+bdQ7gv+7GsYMsd2auJWi1dHOSfTr9HIF4 - HJhWXT9d2f8W+diRYXGh4X0wYiGg6na/soXc+vdtDYBzIxanRqjg8jCAeo1eOTk1 - EdTwhcTZlI0x5pvJ3H0+4hA2jtldVtmPM4OTB0cTrEWBad7XV6YgiyuII73Ve3I= - =jKHM - -----END PGP SIGNATURE----- - -signed commit - -signed commit message body ----- - -- verify with: `git verify-commit [-v]` (or `git show --show-signature`) - ----- -gpg: Signature made Wed Jun 15 10:58:57 2016 CEST using RSA key ID B7227189 -gpg: Good signature from "Eris Discordia <discord@example.net>" -gpg: WARNING: This key is not certified with a trusted signature! -gpg: There is no indication that the signature belongs to the owner. -Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 -tree eebfed94e75e7760540d1485c740902590a00332 -parent 04b871796dc0420f8e7561a895b52484b701d51a -author A U Thor <author@example.com> 1465981137 +0000 -committer C O Mitter <committer@example.com> 1465981137 +0000 - -signed commit - -signed commit message body ----- - -== Mergetag signatures - -- created by: `git merge` on signed tag -- payload/embedding: the whole signed tag object is embedded into - the (merge) commit object as header entry `mergetag` -- example: merge of the signed tag `signedtag` as above - ----- -tree c7b1cff039a93f3600a1d18b82d26688668c7dea -parent c33429be94b5f2d3ee9b0adad223f877f174b05d -parent 04b871796dc0420f8e7561a895b52484b701d51a -author A U Thor <author@example.com> 1465982009 +0000 -committer C O Mitter <committer@example.com> 1465982009 +0000 -mergetag object 04b871796dc0420f8e7561a895b52484b701d51a - type commit - tag signedtag - tagger C O Mitter <committer@example.com> 1465981006 +0000 - - signed tag - - signed tag message body - -----BEGIN PGP SIGNATURE----- - Version: GnuPG v1 - - iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn - rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh - 8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods - q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0 - rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x - lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E= - =jpXa - -----END PGP SIGNATURE----- - -Merge tag 'signedtag' into downstream - -signed tag - -signed tag message body - -# gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189 -# gpg: Good signature from "Eris Discordia <discord@example.net>" -# gpg: WARNING: This key is not certified with a trusted signature! -# gpg: There is no indication that the signature belongs to the owner. -# Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 ----- - -- verify with: verification is embedded in merge commit message by default, - alternatively with `git show --show-signature`: - ----- -commit 9863f0c76ff78712b6800e199a46aa56afbcbd49 -merged tag 'signedtag' -gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189 -gpg: Good signature from "Eris Discordia <discord@example.net>" -gpg: WARNING: This key is not certified with a trusted signature! -gpg: There is no indication that the signature belongs to the owner. -Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 -Merge: c33429b 04b8717 -Author: A U Thor <author@example.com> -Date: Wed Jun 15 09:13:29 2016 +0000 - - Merge tag 'signedtag' into downstream - - signed tag - - signed tag message body - - # gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189 - # gpg: Good signature from "Eris Discordia <discord@example.net>" - # gpg: WARNING: This key is not certified with a trusted signature! - # gpg: There is no indication that the signature belongs to the owner. - # Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 ----- diff --git a/third_party/git/Documentation/technical/trivial-merge.txt b/third_party/git/Documentation/technical/trivial-merge.txt deleted file mode 100644 index 1f1c33d0da30..000000000000 --- a/third_party/git/Documentation/technical/trivial-merge.txt +++ /dev/null @@ -1,121 +0,0 @@ -Trivial merge rules -=================== - -This document describes the outcomes of the trivial merge logic in read-tree. - -One-way merge -------------- - -This replaces the index with a different tree, keeping the stat info -for entries that don't change, and allowing -u to make the minimum -required changes to the working tree to have it match. - -Entries marked '+' have stat information. Spaces marked '*' don't -affect the result. - - index tree result - ----------------------- - * (empty) (empty) - (empty) tree tree - index+ tree tree - index+ index index+ - -Two-way merge -------------- - -It is permitted for the index to lack an entry; this does not prevent -any case from applying. - -If the index exists, it is an error for it not to match either the old -or the result. - -If multiple cases apply, the one used is listed first. - -A result which changes the index is an error if the index is not empty -and not up to date. - -Entries marked '+' have stat information. Spaces marked '*' don't -affect the result. - - case index old new result - ------------------------------------- - 0/2 (empty) * (empty) (empty) - 1/3 (empty) * new new - 4/5 index+ (empty) (empty) index+ - 6/7 index+ (empty) index index+ - 10 index+ index (empty) (empty) - 14/15 index+ old old index+ - 18/19 index+ old index index+ - 20 index+ index new new - -Three-way merge ---------------- - -It is permitted for the index to lack an entry; this does not prevent -any case from applying. - -If the index exists, it is an error for it not to match either the -head or (if the merge is trivial) the result. - -If multiple cases apply, the one used is listed first. - -A result of "no merge" means that index is left in stage 0, ancest in -stage 1, head in stage 2, and remote in stage 3 (if any of these are -empty, no entry is left for that stage). Otherwise, the given entry is -left in stage 0, and there are no other entries. - -A result of "no merge" is an error if the index is not empty and not -up to date. - -*empty* means that the tree must not have a directory-file conflict - with the entry. - -For multiple ancestors, a '+' means that this case applies even if -only one ancestor or remote fits; a '^' means all of the ancestors -must be the same. - - case ancest head remote result - ---------------------------------------- - 1 (empty)+ (empty) (empty) (empty) - 2ALT (empty)+ *empty* remote remote - 2 (empty)^ (empty) remote no merge - 3ALT (empty)+ head *empty* head - 3 (empty)^ head (empty) no merge - 4 (empty)^ head remote no merge - 5ALT * head head head - 6 ancest+ (empty) (empty) no merge - 8 ancest^ (empty) ancest no merge - 7 ancest+ (empty) remote no merge - 10 ancest^ ancest (empty) no merge - 9 ancest+ head (empty) no merge - 16 anc1/anc2 anc1 anc2 no merge - 13 ancest+ head ancest head - 14 ancest+ ancest remote remote - 11 ancest+ head remote no merge - -Only #2ALT and #3ALT use *empty*, because these are the only cases -where there can be conflicts that didn't exist before. Note that we -allow directory-file conflicts between things in different stages -after the trivial merge. - -A possible alternative for #6 is (empty), which would make it like -#1. This is not used, due to the likelihood that it arises due to -moving the file to multiple different locations or moving and deleting -it in different branches. - -Case #1 is included for completeness, and also in case we decide to -put on '+' markings; any path that is never mentioned at all isn't -handled. - -Note that #16 is when both #13 and #14 apply; in this case, we refuse -the trivial merge, because we can't tell from this data which is -right. This is a case of a reverted patch (in some direction, maybe -multiple times), and the right answer depends on looking at crossings -of history or common ancestors of the ancestors. - -Note that, between #6, #7, #9, and #11, all cases not otherwise -covered are handled in this table. - -For #8 and #10, there is alternative behavior, not currently -implemented, where the result is (empty). As currently implemented, -the automatic merge will generally give this effect. |