diff options
Diffstat (limited to 'Documentation/technical')
52 files changed, 10542 insertions, 0 deletions
diff --git a/Documentation/technical/.gitignore b/Documentation/technical/.gitignore new file mode 100644 index 000000000000..8aa891daee05 --- /dev/null +++ b/Documentation/technical/.gitignore @@ -0,0 +1 @@ +api-index.txt diff --git a/Documentation/technical/api-allocation-growing.txt b/Documentation/technical/api-allocation-growing.txt new file mode 100644 index 000000000000..5a59b548448f --- /dev/null +++ b/Documentation/technical/api-allocation-growing.txt @@ -0,0 +1,39 @@ +allocation growing API +====================== + +Dynamically growing an array using realloc() is error prone and boring. + +Define your array with: + +* a pointer (`item`) that points at the array, initialized to `NULL` + (although please name the variable based on its contents, not on its + type); + +* an integer variable (`alloc`) that keeps track of how big the current + allocation is, initialized to `0`; + +* another integer variable (`nr`) to keep track of how many elements the + array currently has, initialized to `0`. + +Then before adding `n`th element to the item, call `ALLOC_GROW(item, n, +alloc)`. This ensures that the array can hold at least `n` elements by +calling `realloc(3)` and adjusting `alloc` variable. + +------------ +sometype *item; +size_t nr; +size_t alloc + +for (i = 0; i < nr; i++) + if (we like item[i] already) + return; + +/* we did not like any existing one, so add one */ +ALLOC_GROW(item, nr + 1, alloc); +item[nr++] = value you like; +------------ + +You are responsible for updating the `nr` variable. + +If you need to specify the number of elements to allocate explicitly +then use the macro `REALLOC_ARRAY(item, alloc)` instead of `ALLOC_GROW`. diff --git a/Documentation/technical/api-argv-array.txt b/Documentation/technical/api-argv-array.txt new file mode 100644 index 000000000000..870c8edbfb1d --- /dev/null +++ b/Documentation/technical/api-argv-array.txt @@ -0,0 +1,65 @@ +argv-array API +============== + +The argv-array API allows one to dynamically build and store +NULL-terminated lists. An argv-array maintains the invariant that the +`argv` member always points to a non-NULL array, and that the array is +always NULL-terminated at the element pointed to by `argv[argc]`. This +makes the result suitable for passing to functions expecting to receive +argv from main(), or the link:api-run-command.html[run-command API]. + +The string-list API (documented in string-list.h) is similar, but cannot be +used for these purposes; instead of storing a straight string pointer, +it contains an item structure with a `util` field that is not compatible +with the traditional argv interface. + +Each `argv_array` manages its own memory. Any strings pushed into the +array are duplicated, and all memory is freed by argv_array_clear(). + +Data Structures +--------------- + +`struct argv_array`:: + + A single array. This should be initialized by assignment from + `ARGV_ARRAY_INIT`, or by calling `argv_array_init`. The `argv` + member contains the actual array; the `argc` member contains the + number of elements in the array, not including the terminating + NULL. + +Functions +--------- + +`argv_array_init`:: + Initialize an array. This is no different than assigning from + `ARGV_ARRAY_INIT`. + +`argv_array_push`:: + Push a copy of a string onto the end of the array. + +`argv_array_pushl`:: + Push a list of strings onto the end of the array. The arguments + should be a list of `const char *` strings, terminated by a NULL + argument. + +`argv_array_pushf`:: + Format a string and push it onto the end of the array. This is a + convenience wrapper combining `strbuf_addf` and `argv_array_push`. + +`argv_array_pushv`:: + Push a null-terminated array of strings onto the end of the array. + +`argv_array_pop`:: + Remove the final element from the array. If there are no + elements in the array, do nothing. + +`argv_array_clear`:: + Free all memory associated with the array and return it to the + initial, empty state. + +`argv_array_detach`:: + Disconnect the `argv` member from the `argv_array` struct and + return it. The caller is responsible for freeing the memory used + by the array, and by the strings it references. After detaching, + the `argv_array` is in a reinitialized state and can be pushed + into again. diff --git a/Documentation/technical/api-config.txt b/Documentation/technical/api-config.txt new file mode 100644 index 000000000000..7d20716c32a4 --- /dev/null +++ b/Documentation/technical/api-config.txt @@ -0,0 +1,319 @@ +config API +========== + +The config API gives callers a way to access Git configuration files +(and files which have the same syntax). See linkgit:git-config[1] for a +discussion of the config file syntax. + +General Usage +------------- + +Config files are parsed linearly, and each variable found is passed to a +caller-provided callback function. The callback function is responsible +for any actions to be taken on the config option, and is free to ignore +some options. It is not uncommon for the configuration to be parsed +several times during the run of a Git program, with different callbacks +picking out different variables useful to themselves. + +A config callback function takes three parameters: + +- the name of the parsed variable. This is in canonical "flat" form: the + section, subsection, and variable segments will be separated by dots, + and the section and variable segments will be all lowercase. E.g., + `core.ignorecase`, `diff.SomeType.textconv`. + +- the value of the found variable, as a string. If the variable had no + value specified, the value will be NULL (typically this means it + should be interpreted as boolean true). + +- a void pointer passed in by the caller of the config API; this can + contain callback-specific data + +A config callback should return 0 for success, or -1 if the variable +could not be parsed properly. + +Basic Config Querying +--------------------- + +Most programs will simply want to look up variables in all config files +that Git knows about, using the normal precedence rules. To do this, +call `git_config` with a callback function and void data pointer. + +`git_config` will read all config sources in order of increasing +priority. Thus a callback should typically overwrite previously-seen +entries with new ones (e.g., if both the user-wide `~/.gitconfig` and +repo-specific `.git/config` contain `color.ui`, the config machinery +will first feed the user-wide one to the callback, and then the +repo-specific one; by overwriting, the higher-priority repo-specific +value is left at the end). + +The `config_with_options` function lets the caller examine config +while adjusting some of the default behavior of `git_config`. It should +almost never be used by "regular" Git code that is looking up +configuration variables. It is intended for advanced callers like +`git-config`, which are intentionally tweaking the normal config-lookup +process. It takes two extra parameters: + +`config_source`:: +If this parameter is non-NULL, it specifies the source to parse for +configuration, rather than looking in the usual files. See `struct +git_config_source` in `config.h` for details. Regular `git_config` defaults +to `NULL`. + +`opts`:: +Specify options to adjust the behavior of parsing config files. See `struct +config_options` in `config.h` for details. As an example: regular `git_config` +sets `opts.respect_includes` to `1` by default. + +Reading Specific Files +---------------------- + +To read a specific file in git-config format, use +`git_config_from_file`. This takes the same callback and data parameters +as `git_config`. + +Querying For Specific Variables +------------------------------- + +For programs wanting to query for specific variables in a non-callback +manner, the config API provides two functions `git_config_get_value` +and `git_config_get_value_multi`. They both read values from an internal +cache generated previously from reading the config files. + +`int git_config_get_value(const char *key, const char **value)`:: + + Finds the highest-priority value for the configuration variable `key`, + stores the pointer to it in `value` and returns 0. When the + configuration variable `key` is not found, returns 1 without touching + `value`. The caller should not free or modify `value`, as it is owned + by the cache. + +`const struct string_list *git_config_get_value_multi(const char *key)`:: + + Finds and returns the value list, sorted in order of increasing priority + for the configuration variable `key`. When the configuration variable + `key` is not found, returns NULL. The caller should not free or modify + the returned pointer, as it is owned by the cache. + +`void git_config_clear(void)`:: + + Resets and invalidates the config cache. + +The config API also provides type specific API functions which do conversion +as well as retrieval for the queried variable, including: + +`int git_config_get_int(const char *key, int *dest)`:: + + Finds and parses the value to an integer for the configuration variable + `key`. Dies on error; otherwise, stores the value of the parsed integer in + `dest` and returns 0. When the configuration variable `key` is not found, + returns 1 without touching `dest`. + +`int git_config_get_ulong(const char *key, unsigned long *dest)`:: + + Similar to `git_config_get_int` but for unsigned longs. + +`int git_config_get_bool(const char *key, int *dest)`:: + + Finds and parses the value into a boolean value, for the configuration + variable `key` respecting keywords like "true" and "false". Integer + values are converted into true/false values (when they are non-zero or + zero, respectively). Other values cause a die(). If parsing is successful, + stores the value of the parsed result in `dest` and returns 0. When the + configuration variable `key` is not found, returns 1 without touching + `dest`. + +`int git_config_get_bool_or_int(const char *key, int *is_bool, int *dest)`:: + + Similar to `git_config_get_bool`, except that integers are copied as-is, + and `is_bool` flag is unset. + +`int git_config_get_maybe_bool(const char *key, int *dest)`:: + + Similar to `git_config_get_bool`, except that it returns -1 on error + rather than dying. + +`int git_config_get_string_const(const char *key, const char **dest)`:: + + Allocates and copies the retrieved string into the `dest` parameter for + the configuration variable `key`; if NULL string is given, prints an + error message and returns -1. When the configuration variable `key` is + not found, returns 1 without touching `dest`. + +`int git_config_get_string(const char *key, char **dest)`:: + + Similar to `git_config_get_string_const`, except that retrieved value + copied into the `dest` parameter is a mutable string. + +`int git_config_get_pathname(const char *key, const char **dest)`:: + + Similar to `git_config_get_string`, but expands `~` or `~user` into + the user's home directory when found at the beginning of the path. + +`git_die_config(const char *key, const char *err, ...)`:: + + First prints the error message specified by the caller in `err` and then + dies printing the line number and the file name of the highest priority + value for the configuration variable `key`. + +`void git_die_config_linenr(const char *key, const char *filename, int linenr)`:: + + Helper function which formats the die error message according to the + parameters entered. Used by `git_die_config()`. It can be used by callers + handling `git_config_get_value_multi()` to print the correct error message + for the desired value. + +See test-config.c for usage examples. + +Value Parsing Helpers +--------------------- + +To aid in parsing string values, the config API provides callbacks with +a number of helper functions, including: + +`git_config_int`:: +Parse the string to an integer, including unit factors. Dies on error; +otherwise, returns the parsed result. + +`git_config_ulong`:: +Identical to `git_config_int`, but for unsigned longs. + +`git_config_bool`:: +Parse a string into a boolean value, respecting keywords like "true" and +"false". Integer values are converted into true/false values (when they +are non-zero or zero, respectively). Other values cause a die(). If +parsing is successful, the return value is the result. + +`git_config_bool_or_int`:: +Same as `git_config_bool`, except that integers are returned as-is, and +an `is_bool` flag is unset. + +`git_parse_maybe_bool`:: +Same as `git_config_bool`, except that it returns -1 on error rather +than dying. + +`git_config_string`:: +Allocates and copies the value string into the `dest` parameter; if no +string is given, prints an error message and returns -1. + +`git_config_pathname`:: +Similar to `git_config_string`, but expands `~` or `~user` into the +user's home directory when found at the beginning of the path. + +Include Directives +------------------ + +By default, the config parser does not respect include directives. +However, a caller can use the special `git_config_include` wrapper +callback to support them. To do so, you simply wrap your "real" callback +function and data pointer in a `struct config_include_data`, and pass +the wrapper to the regular config-reading functions. For example: + +------------------------------------------- +int read_file_with_include(const char *file, config_fn_t fn, void *data) +{ + struct config_include_data inc = CONFIG_INCLUDE_INIT; + inc.fn = fn; + inc.data = data; + return git_config_from_file(git_config_include, file, &inc); +} +------------------------------------------- + +`git_config` respects includes automatically. The lower-level +`git_config_from_file` does not. + +Custom Configsets +----------------- + +A `config_set` can be used to construct an in-memory cache for +config-like files that the caller specifies (i.e., files like `.gitmodules`, +`~/.gitconfig` etc.). For example, + +---------------------------------------- +struct config_set gm_config; +git_configset_init(&gm_config); +int b; +/* we add config files to the config_set */ +git_configset_add_file(&gm_config, ".gitmodules"); +git_configset_add_file(&gm_config, ".gitmodules_alt"); + +if (!git_configset_get_bool(gm_config, "submodule.frotz.ignore", &b)) { + /* hack hack hack */ +} + +/* when we are done with the configset */ +git_configset_clear(&gm_config); +---------------------------------------- + +Configset API provides functions for the above mentioned work flow, including: + +`void git_configset_init(struct config_set *cs)`:: + + Initializes the config_set `cs`. + +`int git_configset_add_file(struct config_set *cs, const char *filename)`:: + + Parses the file and adds the variable-value pairs to the `config_set`, + dies if there is an error in parsing the file. Returns 0 on success, or + -1 if the file does not exist or is inaccessible. The user has to decide + if he wants to free the incomplete configset or continue using it when + the function returns -1. + +`int git_configset_get_value(struct config_set *cs, const char *key, const char **value)`:: + + Finds the highest-priority value for the configuration variable `key` + and config set `cs`, stores the pointer to it in `value` and returns 0. + When the configuration variable `key` is not found, returns 1 without + touching `value`. The caller should not free or modify `value`, as it + is owned by the cache. + +`const struct string_list *git_configset_get_value_multi(struct config_set *cs, const char *key)`:: + + Finds and returns the value list, sorted in order of increasing priority + for the configuration variable `key` and config set `cs`. When the + configuration variable `key` is not found, returns NULL. The caller + should not free or modify the returned pointer, as it is owned by the cache. + +`void git_configset_clear(struct config_set *cs)`:: + + Clears `config_set` structure, removes all saved variable-value pairs. + +In addition to above functions, the `config_set` API provides type specific +functions in the vein of `git_config_get_int` and family but with an extra +parameter, pointer to struct `config_set`. +They all behave similarly to the `git_config_get*()` family described in +"Querying For Specific Variables" above. + +Writing Config Files +-------------------- + +Git gives multiple entry points in the Config API to write config values to +files namely `git_config_set_in_file` and `git_config_set`, which write to +a specific config file or to `.git/config` respectively. They both take a +key/value pair as parameter. +In the end they both call `git_config_set_multivar_in_file` which takes four +parameters: + +- the name of the file, as a string, to which key/value pairs will be written. + +- the name of key, as a string. This is in canonical "flat" form: the section, + subsection, and variable segments will be separated by dots, and the section + and variable segments will be all lowercase. + E.g., `core.ignorecase`, `diff.SomeType.textconv`. + +- the value of the variable, as a string. If value is equal to NULL, it will + remove the matching key from the config file. + +- the value regex, as a string. It will disregard key/value pairs where value + does not match. + +- a multi_replace value, as an int. If value is equal to zero, nothing or only + one matching key/value is replaced, else all matching key/values (regardless + how many) are removed, before the new pair is written. + +It returns 0 on success. + +Also, there are functions `git_config_rename_section` and +`git_config_rename_section_in_file` with parameters `old_name` and `new_name` +for renaming or removing sections in the config files. If NULL is passed +through `new_name` parameter, the section will be removed from the config file. diff --git a/Documentation/technical/api-credentials.txt b/Documentation/technical/api-credentials.txt new file mode 100644 index 000000000000..75368f26ca28 --- /dev/null +++ b/Documentation/technical/api-credentials.txt @@ -0,0 +1,271 @@ +credentials API +=============== + +The credentials API provides an abstracted way of gathering username and +password credentials from the user (even though credentials in the wider +world can take many forms, in this document the word "credential" always +refers to a username and password pair). + +This document describes two interfaces: the C API that the credential +subsystem provides to the rest of Git, and the protocol that Git uses to +communicate with system-specific "credential helpers". If you are +writing Git code that wants to look up or prompt for credentials, see +the section "C API" below. If you want to write your own helper, see +the section on "Credential Helpers" below. + +Typical setup +------------- + +------------ ++-----------------------+ +| Git code (C) |--- to server requiring ---> +| | authentication +|.......................| +| C credential API |--- prompt ---> User ++-----------------------+ + ^ | + | pipe | + | v ++-----------------------+ +| Git credential helper | ++-----------------------+ +------------ + +The Git code (typically a remote-helper) will call the C API to obtain +credential data like a login/password pair (credential_fill). The +API will itself call a remote helper (e.g. "git credential-cache" or +"git credential-store") that may retrieve credential data from a +store. If the credential helper cannot find the information, the C API +will prompt the user. Then, the caller of the API takes care of +contacting the server, and does the actual authentication. + +C API +----- + +The credential C API is meant to be called by Git code which needs to +acquire or store a credential. It is centered around an object +representing a single credential and provides three basic operations: +fill (acquire credentials by calling helpers and/or prompting the user), +approve (mark a credential as successfully used so that it can be stored +for later use), and reject (mark a credential as unsuccessful so that it +can be erased from any persistent storage). + +Data Structures +~~~~~~~~~~~~~~~ + +`struct credential`:: + + This struct represents a single username/password combination + along with any associated context. All string fields should be + heap-allocated (or NULL if they are not known or not applicable). + The meaning of the individual context fields is the same as + their counterparts in the helper protocol; see the section below + for a description of each field. ++ +The `helpers` member of the struct is a `string_list` of helpers. Each +string specifies an external helper which will be run, in order, to +either acquire or store credentials. See the section on credential +helpers below. This list is filled-in by the API functions +according to the corresponding configuration variables before +consulting helpers, so there usually is no need for a caller to +modify the helpers field at all. ++ +This struct should always be initialized with `CREDENTIAL_INIT` or +`credential_init`. + + +Functions +~~~~~~~~~ + +`credential_init`:: + + Initialize a credential structure, setting all fields to empty. + +`credential_clear`:: + + Free any resources associated with the credential structure, + returning it to a pristine initialized state. + +`credential_fill`:: + + Instruct the credential subsystem to fill the username and + password fields of the passed credential struct by first + consulting helpers, then asking the user. After this function + returns, the username and password fields of the credential are + guaranteed to be non-NULL. If an error occurs, the function will + die(). + +`credential_reject`:: + + Inform the credential subsystem that the provided credentials + have been rejected. This will cause the credential subsystem to + notify any helpers of the rejection (which allows them, for + example, to purge the invalid credentials from storage). It + will also free() the username and password fields of the + credential and set them to NULL (readying the credential for + another call to `credential_fill`). Any errors from helpers are + ignored. + +`credential_approve`:: + + Inform the credential subsystem that the provided credentials + were successfully used for authentication. This will cause the + credential subsystem to notify any helpers of the approval, so + that they may store the result to be used again. Any errors + from helpers are ignored. + +`credential_from_url`:: + + Parse a URL into broken-down credential fields. + +Example +~~~~~~~ + +The example below shows how the functions of the credential API could be +used to login to a fictitious "foo" service on a remote host: + +----------------------------------------------------------------------- +int foo_login(struct foo_connection *f) +{ + int status; + /* + * Create a credential with some context; we don't yet know the + * username or password. + */ + + struct credential c = CREDENTIAL_INIT; + c.protocol = xstrdup("foo"); + c.host = xstrdup(f->hostname); + + /* + * Fill in the username and password fields by contacting + * helpers and/or asking the user. The function will die if it + * fails. + */ + credential_fill(&c); + + /* + * Otherwise, we have a username and password. Try to use it. + */ + status = send_foo_login(f, c.username, c.password); + switch (status) { + case FOO_OK: + /* It worked. Store the credential for later use. */ + credential_accept(&c); + break; + case FOO_BAD_LOGIN: + /* Erase the credential from storage so we don't try it + * again. */ + credential_reject(&c); + break; + default: + /* + * Some other error occurred. We don't know if the + * credential is good or bad, so report nothing to the + * credential subsystem. + */ + } + + /* Free any associated resources. */ + credential_clear(&c); + + return status; +} +----------------------------------------------------------------------- + + +Credential Helpers +------------------ + +Credential helpers are programs executed by Git to fetch or save +credentials from and to long-term storage (where "long-term" is simply +longer than a single Git process; e.g., credentials may be stored +in-memory for a few minutes, or indefinitely on disk). + +Each helper is specified by a single string in the configuration +variable `credential.helper` (and others, see linkgit:git-config[1]). +The string is transformed by Git into a command to be executed using +these rules: + + 1. If the helper string begins with "!", it is considered a shell + snippet, and everything after the "!" becomes the command. + + 2. Otherwise, if the helper string begins with an absolute path, the + verbatim helper string becomes the command. + + 3. Otherwise, the string "git credential-" is prepended to the helper + string, and the result becomes the command. + +The resulting command then has an "operation" argument appended to it +(see below for details), and the result is executed by the shell. + +Here are some example specifications: + +---------------------------------------------------- +# run "git credential-foo" +foo + +# same as above, but pass an argument to the helper +foo --bar=baz + +# the arguments are parsed by the shell, so use shell +# quoting if necessary +foo --bar="whitespace arg" + +# you can also use an absolute path, which will not use the git wrapper +/path/to/my/helper --with-arguments + +# or you can specify your own shell snippet +!f() { echo "password=`cat $HOME/.secret`"; }; f +---------------------------------------------------- + +Generally speaking, rule (3) above is the simplest for users to specify. +Authors of credential helpers should make an effort to assist their +users by naming their program "git-credential-$NAME", and putting it in +the $PATH or $GIT_EXEC_PATH during installation, which will allow a user +to enable it with `git config credential.helper $NAME`. + +When a helper is executed, it will have one "operation" argument +appended to its command line, which is one of: + +`get`:: + + Return a matching credential, if any exists. + +`store`:: + + Store the credential, if applicable to the helper. + +`erase`:: + + Remove a matching credential, if any, from the helper's storage. + +The details of the credential will be provided on the helper's stdin +stream. The exact format is the same as the input/output format of the +`git credential` plumbing command (see the section `INPUT/OUTPUT +FORMAT` in linkgit:git-credential[1] for a detailed specification). + +For a `get` operation, the helper should produce a list of attributes +on stdout in the same format. A helper is free to produce a subset, or +even no values at all if it has nothing useful to provide. Any provided +attributes will overwrite those already known about by Git. If a helper +outputs a `quit` attribute with a value of `true` or `1`, no further +helpers will be consulted, nor will the user be prompted (if no +credential has been provided, the operation will then fail). + +For a `store` or `erase` operation, the helper's output is ignored. +If it fails to perform the requested operation, it may complain to +stderr to inform the user. If it does not support the requested +operation (e.g., a read-only store), it should silently ignore the +request. + +If a helper receives any other operation, it should silently ignore the +request. This leaves room for future operations to be added (older +helpers will just ignore the new requests). + +See also +-------- + +linkgit:gitcredentials[7] + +linkgit:git-config[1] (See configuration variables `credential.*`) diff --git a/Documentation/technical/api-diff.txt b/Documentation/technical/api-diff.txt new file mode 100644 index 000000000000..30fc0e9c93b2 --- /dev/null +++ b/Documentation/technical/api-diff.txt @@ -0,0 +1,174 @@ +diff API +======== + +The diff API is for programs that compare two sets of files (e.g. two +trees, one tree and the index) and present the found difference in +various ways. The calling program is responsible for feeding the API +pairs of files, one from the "old" set and the corresponding one from +"new" set, that are different. The library called through this API is +called diffcore, and is responsible for two things. + +* finding total rewrites (`-B`), renames (`-M`) and copies (`-C`), and + changes that touch a string (`-S`), as specified by the caller. + +* outputting the differences in various formats, as specified by the + caller. + +Calling sequence +---------------- + +* Prepare `struct diff_options` to record the set of diff options, and + then call `repo_diff_setup()` to initialize this structure. This + sets up the vanilla default. + +* Fill in the options structure to specify desired output format, rename + detection, etc. `diff_opt_parse()` can be used to parse options given + from the command line in a way consistent with existing git-diff + family of programs. + +* Call `diff_setup_done()`; this inspects the options set up so far for + internal consistency and make necessary tweaking to it (e.g. if + textual patch output was asked, recursive behaviour is turned on); + the callback set_default in diff_options can be used to tweak this more. + +* As you find different pairs of files, call `diff_change()` to feed + modified files, `diff_addremove()` to feed created or deleted files, + or `diff_unmerge()` to feed a file whose state is 'unmerged' to the + API. These are thin wrappers to a lower-level `diff_queue()` function + that is flexible enough to record any of these kinds of changes. + +* Once you finish feeding the pairs of files, call `diffcore_std()`. + This will tell the diffcore library to go ahead and do its work. + +* Calling `diff_flush()` will produce the output. + + +Data structures +--------------- + +* `struct diff_filespec` + +This is the internal representation for a single file (blob). It +records the blob object name (if known -- for a work tree file it +typically is a NUL SHA-1), filemode and pathname. This is what the +`diff_addremove()`, `diff_change()` and `diff_unmerge()` synthesize and +feed `diff_queue()` function with. + +* `struct diff_filepair` + +This records a pair of `struct diff_filespec`; the filespec for a file +in the "old" set (i.e. preimage) is called `one`, and the filespec for a +file in the "new" set (i.e. postimage) is called `two`. A change that +represents file creation has NULL in `one`, and file deletion has NULL +in `two`. + +A `filepair` starts pointing at `one` and `two` that are from the same +filename, but `diffcore_std()` can break pairs and match component +filespecs with other filespecs from a different filepair to form new +filepair. This is called 'rename detection'. + +* `struct diff_queue` + +This is a collection of filepairs. Notable members are: + +`queue`:: + + An array of pointers to `struct diff_filepair`. This + dynamically grows as you add filepairs; + +`alloc`:: + + The allocated size of the `queue` array; + +`nr`:: + + The number of elements in the `queue` array. + + +* `struct diff_options` + +This describes the set of options the calling program wants to affect +the operation of diffcore library with. + +Notable members are: + +`output_format`:: + The output format used when `diff_flush()` is run. + +`context`:: + Number of context lines to generate in patch output. + +`break_opt`, `detect_rename`, `rename-score`, `rename_limit`:: + Affects the way detection logic for complete rewrites, renames + and copies. + +`abbrev`:: + Number of hexdigits to abbreviate raw format output to. + +`pickaxe`:: + A constant string (can and typically does contain newlines to + look for a block of text, not just a single line) to filter out + the filepairs that do not change the number of strings contained + in its preimage and postimage of the diff_queue. + +`flags`:: + This is mostly a collection of boolean options that affects the + operation, but some do not have anything to do with the diffcore + library. + +`touched_flags`:: + Records whether a flag has been changed due to user request + (rather than just set/unset by default). + +`set_default`:: + Callback which allows tweaking the options in diff_setup_done(). + +BINARY, TEXT;; + Affects the way how a file that is seemingly binary is treated. + +FULL_INDEX;; + Tells the patch output format not to use abbreviated object + names on the "index" lines. + +FIND_COPIES_HARDER;; + Tells the diffcore library that the caller is feeding unchanged + filepairs to allow copies from unmodified files be detected. + +COLOR_DIFF;; + Output should be colored. + +COLOR_DIFF_WORDS;; + Output is a colored word-diff. + +NO_INDEX;; + Tells diff-files that the input is not tracked files but files + in random locations on the filesystem. + +ALLOW_EXTERNAL;; + Tells output routine that it is Ok to call user specified patch + output routine. Plumbing disables this to ensure stable output. + +QUIET;; + Do not show any output. + +REVERSE_DIFF;; + Tells the library that the calling program is feeding the + filepairs reversed; `one` is two, and `two` is one. + +EXIT_WITH_STATUS;; + For communication between the calling program and the options + parser; tell the calling program to signal the presence of + difference using program exit code. + +HAS_CHANGES;; + Internal; used for optimization to see if there is any change. + +SILENT_ON_REMOVE;; + Affects if diff-files shows removed files. + +RECURSIVE, TREE_IN_RECURSIVE;; + Tells if tree traversal done by tree-diff should recursively + descend into a tree object pair that are different in preimage + and postimage set. + +(JC) diff --git a/Documentation/technical/api-directory-listing.txt b/Documentation/technical/api-directory-listing.txt new file mode 100644 index 000000000000..5abb8e8b1fdb --- /dev/null +++ b/Documentation/technical/api-directory-listing.txt @@ -0,0 +1,130 @@ +directory listing API +===================== + +The directory listing API is used to enumerate paths in the work tree, +optionally taking `.git/info/exclude` and `.gitignore` files per +directory into account. + +Data structure +-------------- + +`struct dir_struct` structure is used to pass directory traversal +options to the library and to record the paths discovered. A single +`struct dir_struct` is used regardless of whether or not the traversal +recursively descends into subdirectories. + +The notable options are: + +`exclude_per_dir`:: + + The name of the file to be read in each directory for excluded + files (typically `.gitignore`). + +`flags`:: + + A bit-field of options: + +`DIR_SHOW_IGNORED`::: + + Return just ignored files in `entries[]`, not untracked + files. This flag is mutually exclusive with + `DIR_SHOW_IGNORED_TOO`. + +`DIR_SHOW_IGNORED_TOO`::: + + Similar to `DIR_SHOW_IGNORED`, but return ignored files in + `ignored[]` in addition to untracked files in + `entries[]`. This flag is mutually exclusive with + `DIR_SHOW_IGNORED`. + +`DIR_KEEP_UNTRACKED_CONTENTS`::: + + Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if this is set, the + untracked contents of untracked directories are also returned in + `entries[]`. + +`DIR_SHOW_IGNORED_TOO_MODE_MATCHING`::: + + Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if + this is set, returns ignored files and directories that match + an exclude pattern. If a directory matches an exclude pattern, + then the directory is returned and the contained paths are + not. A directory that does not match an exclude pattern will + not be returned even if all of its contents are ignored. In + this case, the contents are returned as individual entries. ++ +If this is set, files and directories that explicitly match an ignore +pattern are reported. Implicitly ignored directories (directories that +do not match an ignore pattern, but whose contents are all ignored) +are not reported, instead all of the contents are reported. + +`DIR_COLLECT_IGNORED`::: + + Special mode for git-add. Return ignored files in `ignored[]` and + untracked files in `entries[]`. Only returns ignored files that match + pathspec exactly (no wildcards). Does not recurse into ignored + directories. + +`DIR_SHOW_OTHER_DIRECTORIES`::: + + Include a directory that is not tracked. + +`DIR_HIDE_EMPTY_DIRECTORIES`::: + + Do not include a directory that is not tracked and is empty. + +`DIR_NO_GITLINKS`::: + + If set, recurse into a directory that looks like a Git + directory. Otherwise it is shown as a directory. + +The result of the enumeration is left in these fields: + +`entries[]`:: + + An array of `struct dir_entry`, each element of which describes + a path. + +`nr`:: + + The number of members in `entries[]` array. + +`alloc`:: + + Internal use; keeps track of allocation of `entries[]` array. + +`ignored[]`:: + + An array of `struct dir_entry`, used for ignored paths with the + `DIR_SHOW_IGNORED_TOO` and `DIR_COLLECT_IGNORED` flags. + +`ignored_nr`:: + + The number of members in `ignored[]` array. + +Calling sequence +---------------- + +Note: index may be looked at for .gitignore files that are CE_SKIP_WORKTREE +marked. If you to exclude files, make sure you have loaded index first. + +* Prepare `struct dir_struct dir` and clear it with `memset(&dir, 0, + sizeof(dir))`. + +* To add single exclude pattern, call `add_exclude_list()` and then + `add_exclude()`. + +* To add patterns from a file (e.g. `.git/info/exclude`), call + `add_excludes_from_file()` , and/or set `dir.exclude_per_dir`. A + short-hand function `setup_standard_excludes()` can be used to set + up the standard set of exclude settings. + +* Set options described in the Data Structure section above. + +* Call `read_directory()`. + +* Use `dir.entries[]`. + +* Call `clear_directory()` when none of the contained elements are no longer in use. + +(JC) diff --git a/Documentation/technical/api-error-handling.txt b/Documentation/technical/api-error-handling.txt new file mode 100644 index 000000000000..ceeedd485c96 --- /dev/null +++ b/Documentation/technical/api-error-handling.txt @@ -0,0 +1,75 @@ +Error reporting in git +====================== + +`die`, `usage`, `error`, and `warning` report errors of various +kinds. + +- `die` is for fatal application errors. It prints a message to + the user and exits with status 128. + +- `usage` is for errors in command line usage. After printing its + message, it exits with status 129. (See also `usage_with_options` + in the link:api-parse-options.html[parse-options API].) + +- `error` is for non-fatal library errors. It prints a message + to the user and returns -1 for convenience in signaling the error + to the caller. + +- `warning` is for reporting situations that probably should not + occur but which the user (and Git) can continue to work around + without running into too many problems. Like `error`, it + returns -1 after reporting the situation to the caller. + +Customizable error handlers +--------------------------- + +The default behavior of `die` and `error` is to write a message to +stderr and then exit or return as appropriate. This behavior can be +overridden using `set_die_routine` and `set_error_routine`. For +example, "git daemon" uses set_die_routine to write the reason `die` +was called to syslog before exiting. + +Library errors +-------------- + +Functions return a negative integer on error. Details beyond that +vary from function to function: + +- Some functions return -1 for all errors. Others return a more + specific value depending on how the caller might want to react + to the error. + +- Some functions report the error to stderr with `error`, + while others leave that for the caller to do. + +- errno is not meaningful on return from most functions (except + for thin wrappers for system calls). + +Check the function's API documentation to be sure. + +Caller-handled errors +--------------------- + +An increasing number of functions take a parameter 'struct strbuf *err'. +On error, such functions append a message about what went wrong to the +'err' strbuf. The message is meant to be complete enough to be passed +to `die` or `error` as-is. For example: + + if (ref_transaction_commit(transaction, &err)) + die("%s", err.buf); + +The 'err' parameter will be untouched if no error occurred, so multiple +function calls can be chained: + + t = ref_transaction_begin(&err); + if (!t || + ref_transaction_update(t, "HEAD", ..., &err) || + ret_transaction_commit(t, &err)) + die("%s", err.buf); + +The 'err' parameter must be a pointer to a valid strbuf. To silence +a message, pass a strbuf that is explicitly ignored: + + if (thing_that_can_fail_in_an_ignorable_way(..., &err)) + /* This failure is okay. */ + strbuf_reset(&err); diff --git a/Documentation/technical/api-gitattributes.txt b/Documentation/technical/api-gitattributes.txt new file mode 100644 index 000000000000..45f0df600fab --- /dev/null +++ b/Documentation/technical/api-gitattributes.txt @@ -0,0 +1,154 @@ +gitattributes API +================= + +gitattributes mechanism gives a uniform way to associate various +attributes to set of paths. + + +Data Structure +-------------- + +`struct git_attr`:: + + An attribute is an opaque object that is identified by its name. + Pass the name to `git_attr()` function to obtain the object of + this type. The internal representation of this structure is + of no interest to the calling programs. The name of the + attribute can be retrieved by calling `git_attr_name()`. + +`struct attr_check_item`:: + + This structure represents one attribute and its value. + +`struct attr_check`:: + + This structure represents a collection of `attr_check_item`. + It is passed to `git_check_attr()` function, specifying the + attributes to check, and receives their values. + + +Attribute Values +---------------- + +An attribute for a path can be in one of four states: Set, Unset, +Unspecified or set to a string, and `.value` member of `struct +attr_check_item` records it. There are three macros to check these: + +`ATTR_TRUE()`:: + + Returns true if the attribute is Set for the path. + +`ATTR_FALSE()`:: + + Returns true if the attribute is Unset for the path. + +`ATTR_UNSET()`:: + + Returns true if the attribute is Unspecified for the path. + +If none of the above returns true, `.value` member points at a string +value of the attribute for the path. + + +Querying Specific Attributes +---------------------------- + +* Prepare `struct attr_check` using attr_check_initl() + function, enumerating the names of attributes whose values you are + interested in, terminated with a NULL pointer. Alternatively, an + empty `struct attr_check` can be prepared by calling + `attr_check_alloc()` function and then attributes you want to + ask about can be added to it with `attr_check_append()` + function. + +* Call `git_check_attr()` to check the attributes for the path. + +* Inspect `attr_check` structure to see how each of the + attribute in the array is defined for the path. + + +Example +------- + +To see how attributes "crlf" and "ident" are set for different paths. + +. Prepare a `struct attr_check` with two elements (because + we are checking two attributes): + +------------ +static struct attr_check *check; +static void setup_check(void) +{ + if (check) + return; /* already done */ + check = attr_check_initl("crlf", "ident", NULL); +} +------------ + +. Call `git_check_attr()` with the prepared `struct attr_check`: + +------------ + const char *path; + + setup_check(); + git_check_attr(path, check); +------------ + +. Act on `.value` member of the result, left in `check->items[]`: + +------------ + const char *value = check->items[0].value; + + if (ATTR_TRUE(value)) { + The attribute is Set, by listing only the name of the + attribute in the gitattributes file for the path. + } else if (ATTR_FALSE(value)) { + The attribute is Unset, by listing the name of the + attribute prefixed with a dash - for the path. + } else if (ATTR_UNSET(value)) { + The attribute is neither set nor unset for the path. + } else if (!strcmp(value, "input")) { + If none of ATTR_TRUE(), ATTR_FALSE(), or ATTR_UNSET() is + true, the value is a string set in the gitattributes + file for the path by saying "attr=value". + } else if (... other check using value as string ...) { + ... + } +------------ + +To see how attributes in argv[] are set for different paths, only +the first step in the above would be different. + +------------ +static struct attr_check *check; +static void setup_check(const char **argv) +{ + check = attr_check_alloc(); + while (*argv) { + struct git_attr *attr = git_attr(*argv); + attr_check_append(check, attr); + argv++; + } +} +------------ + + +Querying All Attributes +----------------------- + +To get the values of all attributes associated with a file: + +* Prepare an empty `attr_check` structure by calling + `attr_check_alloc()`. + +* Call `git_all_attrs()`, which populates the `attr_check` + with the attributes attached to the path. + +* Iterate over the `attr_check.items[]` array to examine + the attribute names and values. The name of the attribute + described by an `attr_check.items[]` object can be retrieved via + `git_attr_name(check->items[i].attr)`. (Please note that no items + will be returned for unset attributes, so `ATTR_UNSET()` will return + false for all returned `attr_check.items[]` objects.) + +* Free the `attr_check` struct by calling `attr_check_free()`. diff --git a/Documentation/technical/api-grep.txt b/Documentation/technical/api-grep.txt new file mode 100644 index 000000000000..a69cc8964d58 --- /dev/null +++ b/Documentation/technical/api-grep.txt @@ -0,0 +1,8 @@ +grep API +======== + +Talk about <grep.h>, things like: + +* grep_buffer() + +(JC) diff --git a/Documentation/technical/api-history-graph.txt b/Documentation/technical/api-history-graph.txt new file mode 100644 index 000000000000..d0d1707c8c88 --- /dev/null +++ b/Documentation/technical/api-history-graph.txt @@ -0,0 +1,173 @@ +history graph API +================= + +The graph API is used to draw a text-based representation of the commit +history. The API generates the graph in a line-by-line fashion. + +Functions +--------- + +Core functions: + +* `graph_init()` creates a new `struct git_graph` + +* `graph_update()` moves the graph to a new commit. + +* `graph_next_line()` outputs the next line of the graph into a strbuf. It + does not add a terminating newline. + +* `graph_padding_line()` outputs a line of vertical padding in the graph. It + is similar to `graph_next_line()`, but is guaranteed to never print the line + containing the current commit. Where `graph_next_line()` would print the + commit line next, `graph_padding_line()` prints a line that simply extends + all branch lines downwards one row, leaving their positions unchanged. + +* `graph_is_commit_finished()` determines if the graph has output all lines + necessary for the current commit. If `graph_update()` is called before all + lines for the current commit have been printed, the next call to + `graph_next_line()` will output an ellipsis, to indicate that a portion of + the graph was omitted. + +The following utility functions are wrappers around `graph_next_line()` and +`graph_is_commit_finished()`. They always print the output to stdout. +They can all be called with a NULL graph argument, in which case no graph +output will be printed. + +* `graph_show_commit()` calls `graph_next_line()` and + `graph_is_commit_finished()` until one of them return non-zero. This prints + all graph lines up to, and including, the line containing this commit. + Output is printed to stdout. The last line printed does not contain a + terminating newline. + +* `graph_show_oneline()` calls `graph_next_line()` and prints the result to + stdout. The line printed does not contain a terminating newline. + +* `graph_show_padding()` calls `graph_padding_line()` and prints the result to + stdout. The line printed does not contain a terminating newline. + +* `graph_show_remainder()` calls `graph_next_line()` until + `graph_is_commit_finished()` returns non-zero. Output is printed to stdout. + The last line printed does not contain a terminating newline. Returns 1 if + output was printed, and 0 if no output was necessary. + +* `graph_show_strbuf()` prints the specified strbuf to stdout, prefixing all + lines but the first with a graph line. The caller is responsible for + ensuring graph output for the first line has already been printed to stdout. + (This can be done with `graph_show_commit()` or `graph_show_oneline()`.) If + a NULL graph is supplied, the strbuf is printed as-is. + +* `graph_show_commit_msg()` is similar to `graph_show_strbuf()`, but it also + prints the remainder of the graph, if more lines are needed after the strbuf + ends. It is better than directly calling `graph_show_strbuf()` followed by + `graph_show_remainder()` since it properly handles buffers that do not end in + a terminating newline. The output printed by `graph_show_commit_msg()` will + end in a newline if and only if the strbuf ends in a newline. + +Data structure +-------------- +`struct git_graph` is an opaque data type used to store the current graph +state. + +Calling sequence +---------------- + +* Create a `struct git_graph` by calling `graph_init()`. When using the + revision walking API, this is done automatically by `setup_revisions()` if + the '--graph' option is supplied. + +* Use the revision walking API to walk through a group of contiguous commits. + The `get_revision()` function automatically calls `graph_update()` each time + it is invoked. + +* For each commit, call `graph_next_line()` repeatedly, until + `graph_is_commit_finished()` returns non-zero. Each call to + `graph_next_line()` will output a single line of the graph. The resulting + lines will not contain any newlines. `graph_next_line()` returns 1 if the + resulting line contains the current commit, or 0 if this is merely a line + needed to adjust the graph before or after the current commit. This return + value can be used to determine where to print the commit summary information + alongside the graph output. + +Limitations +----------- + +* `graph_update()` must be called with commits in topological order. It should + not be called on a commit if it has already been invoked with an ancestor of + that commit, or the graph output will be incorrect. + +* `graph_update()` must be called on a contiguous group of commits. If + `graph_update()` is called on a particular commit, it should later be called + on all parents of that commit. Parents must not be skipped, or the graph + output will appear incorrect. ++ +`graph_update()` may be used on a pruned set of commits only if the parent list +has been rewritten so as to include only ancestors from the pruned set. + +* The graph API does not currently support reverse commit ordering. In + order to implement reverse ordering, the graphing API needs an + (efficient) mechanism to find the children of a commit. + +Sample usage +------------ + +------------ +struct commit *commit; +struct git_graph *graph = graph_init(opts); + +while ((commit = get_revision(opts)) != NULL) { + while (!graph_is_commit_finished(graph)) + { + struct strbuf sb; + int is_commit_line; + + strbuf_init(&sb, 0); + is_commit_line = graph_next_line(graph, &sb); + fputs(sb.buf, stdout); + + if (is_commit_line) + log_tree_commit(opts, commit); + else + putchar(opts->diffopt.line_termination); + } +} +------------ + +Sample output +------------- + +The following is an example of the output from the graph API. This output does +not include any commit summary information--callers are responsible for +outputting that information, if desired. + +------------ +* +* +* +|\ +* | +| | * +| \ \ +| \ \ +*-. \ \ +|\ \ \ \ +| | * | | +| | | | | * +| | | | | * +| | | | | * +| | | | | |\ +| | | | | | * +| * | | | | | +| | | | | * \ +| | | | | |\ | +| | | | * | | | +| | | | * | | | +* | | | | | | | +| |/ / / / / / +|/| / / / / / +* | | | | | | +|/ / / / / / +* | | | | | +| | | | | * +| | | | |/ +| | | | * +------------ diff --git a/Documentation/technical/api-index-skel.txt b/Documentation/technical/api-index-skel.txt new file mode 100644 index 000000000000..eda8c195c196 --- /dev/null +++ b/Documentation/technical/api-index-skel.txt @@ -0,0 +1,13 @@ +Git API Documents +================= + +Git has grown a set of internal API over time. This collection +documents them. + +//////////////////////////////////////////////////////////////// +// table of contents begin +//////////////////////////////////////////////////////////////// + +//////////////////////////////////////////////////////////////// +// table of contents end +//////////////////////////////////////////////////////////////// diff --git a/Documentation/technical/api-index.sh b/Documentation/technical/api-index.sh new file mode 100755 index 000000000000..9c3f4131b858 --- /dev/null +++ b/Documentation/technical/api-index.sh @@ -0,0 +1,28 @@ +#!/bin/sh + +( + c=//////////////////////////////////////////////////////////////// + skel=api-index-skel.txt + sed -e '/^\/\/ table of contents begin/q' "$skel" + echo "$c" + + ls api-*.txt | + while read filename + do + case "$filename" in + api-index-skel.txt | api-index.txt) continue ;; + esac + title=$(sed -e 1q "$filename") + html=${filename%.txt}.html + echo "* link:$html[$title]" + done + echo "$c" + sed -n -e '/^\/\/ table of contents end/,$p' "$skel" +) >api-index.txt+ + +if test -f api-index.txt && cmp api-index.txt api-index.txt+ >/dev/null +then + rm -f api-index.txt+ +else + mv api-index.txt+ api-index.txt +fi diff --git a/Documentation/technical/api-merge.txt b/Documentation/technical/api-merge.txt new file mode 100644 index 000000000000..9dc1bed768a4 --- /dev/null +++ b/Documentation/technical/api-merge.txt @@ -0,0 +1,104 @@ +merge API +========= + +The merge API helps a program to reconcile two competing sets of +improvements to some files (e.g., unregistered changes from the work +tree versus changes involved in switching to a new branch), reporting +conflicts if found. The library called through this API is +responsible for a few things. + + * determining which trees to merge (recursive ancestor consolidation); + + * lining up corresponding files in the trees to be merged (rename + detection, subtree shifting), reporting edge cases like add/add + and rename/rename conflicts to the user; + + * performing a three-way merge of corresponding files, taking + path-specific merge drivers (specified in `.gitattributes`) + into account. + +Data structures +--------------- + +* `mmbuffer_t`, `mmfile_t` + +These store data usable for use by the xdiff backend, for writing and +for reading, respectively. See `xdiff/xdiff.h` for the definitions +and `diff.c` for examples. + +* `struct ll_merge_options` + +This describes the set of options the calling program wants to affect +the operation of a low-level (single file) merge. Some options: + +`virtual_ancestor`:: + Behave as though this were part of a merge between common + ancestors in a recursive merge. + If a helper program is specified by the + `[merge "<driver>"] recursive` configuration, it will + be used (see linkgit:gitattributes[5]). + +`variant`:: + Resolve local conflicts automatically in favor + of one side or the other (as in 'git merge-file' + `--ours`/`--theirs`/`--union`). Can be `0`, + `XDL_MERGE_FAVOR_OURS`, `XDL_MERGE_FAVOR_THEIRS`, or + `XDL_MERGE_FAVOR_UNION`. + +`renormalize`:: + Resmudge and clean the "base", "theirs" and "ours" files + before merging. Use this when the merge is likely to have + overlapped with a change in smudge/clean or end-of-line + normalization rules. + +Low-level (single file) merge +----------------------------- + +`ll_merge`:: + + Perform a three-way single-file merge in core. This is + a thin wrapper around `xdl_merge` that takes the path and + any merge backend specified in `.gitattributes` or + `.git/info/attributes` into account. Returns 0 for a + clean merge. + +Calling sequence: + +* Prepare a `struct ll_merge_options` to record options. + If you have no special requests, skip this and pass `NULL` + as the `opts` parameter to use the default options. + +* Allocate an mmbuffer_t variable for the result. + +* Allocate and fill variables with the file's original content + and two modified versions (using `read_mmfile`, for example). + +* Call `ll_merge()`. + +* Read the merged content from `result_buf.ptr` and `result_buf.size`. + +* Release buffers when finished. A simple + `free(ancestor.ptr); free(ours.ptr); free(theirs.ptr); + free(result_buf.ptr);` will do. + +If the modifications do not merge cleanly, `ll_merge` will return a +nonzero value and `result_buf` will generally include a description of +the conflict bracketed by markers such as the traditional `<<<<<<<` +and `>>>>>>>`. + +The `ancestor_label`, `our_label`, and `their_label` parameters are +used to label the different sides of a conflict if the merge driver +supports this. + +Everything else +--------------- + +Talk about <merge-recursive.h> and merge_file(): + + - merge_trees() to merge with rename detection + - merge_recursive() for ancestor consolidation + - try_merge_command() for other strategies + - conflict format + - merge options + +(Daniel, Miklos, Stephan, JC) diff --git a/Documentation/technical/api-object-access.txt b/Documentation/technical/api-object-access.txt new file mode 100644 index 000000000000..5b29622d00ea --- /dev/null +++ b/Documentation/technical/api-object-access.txt @@ -0,0 +1,15 @@ +object access API +================= + +Talk about <sha1-file.c> and <object.h> family, things like + +* read_sha1_file() +* read_object_with_reference() +* has_sha1_file() +* write_sha1_file() +* pretend_object_file() +* lookup_{object,commit,tag,blob,tree} +* parse_{object,commit,tag,blob,tree} +* Use of object flags + +(JC, Shawn, Daniel, Dscho, Linus) diff --git a/Documentation/technical/api-oid-array.txt b/Documentation/technical/api-oid-array.txt new file mode 100644 index 000000000000..c97428c2c349 --- /dev/null +++ b/Documentation/technical/api-oid-array.txt @@ -0,0 +1,90 @@ +oid-array API +============== + +The oid-array API provides storage and manipulation of sets of object +identifiers. The emphasis is on storage and processing efficiency, +making them suitable for large lists. Note that the ordering of items is +not preserved over some operations. + +Data Structures +--------------- + +`struct oid_array`:: + + A single array of object IDs. This should be initialized by + assignment from `OID_ARRAY_INIT`. The `oid` member contains + the actual data. The `nr` member contains the number of items in + the set. The `alloc` and `sorted` members are used internally, + and should not be needed by API callers. + +Functions +--------- + +`oid_array_append`:: + Add an item to the set. The object ID will be placed at the end of + the array (but note that some operations below may lose this + ordering). + +`oid_array_lookup`:: + Perform a binary search of the array for a specific object ID. + If found, returns the offset (in number of elements) of the + object ID. If not found, returns a negative integer. If the array + is not sorted, this function has the side effect of sorting it. + +`oid_array_clear`:: + Free all memory associated with the array and return it to the + initial, empty state. + +`oid_array_for_each`:: + Iterate over each element of the list, executing the callback + function for each one. Does not sort the list, so any custom + hash order is retained. If the callback returns a non-zero + value, the iteration ends immediately and the callback's + return is propagated; otherwise, 0 is returned. + +`oid_array_for_each_unique`:: + Iterate over each unique element of the list in sorted order, + but otherwise behave like `oid_array_for_each`. If the array + is not sorted, this function has the side effect of sorting + it. + +`oid_array_filter`:: + Apply the callback function `want` to each entry in the array, + retaining only the entries for which the function returns true. + Preserve the order of the entries that are retained. + +Examples +-------- + +----------------------------------------- +int print_callback(const struct object_id *oid, + void *data) +{ + printf("%s\n", oid_to_hex(oid)); + return 0; /* always continue */ +} + +void some_func(void) +{ + struct sha1_array hashes = OID_ARRAY_INIT; + struct object_id oid; + + /* Read objects into our set */ + while (read_object_from_stdin(oid.hash)) + oid_array_append(&hashes, &oid); + + /* Check if some objects are in our set */ + while (read_object_from_stdin(oid.hash)) { + if (oid_array_lookup(&hashes, &oid) >= 0) + printf("it's in there!\n"); + + /* + * Print the unique set of objects. We could also have + * avoided adding duplicate objects in the first place, + * but we would end up re-sorting the array repeatedly. + * Instead, this will sort once and then skip duplicates + * in linear time. + */ + oid_array_for_each_unique(&hashes, print_callback, NULL); +} +----------------------------------------- diff --git a/Documentation/technical/api-parse-options.txt b/Documentation/technical/api-parse-options.txt new file mode 100644 index 000000000000..2e2e7c10c620 --- /dev/null +++ b/Documentation/technical/api-parse-options.txt @@ -0,0 +1,313 @@ +parse-options API +================= + +The parse-options API is used to parse and massage options in Git +and to provide a usage help with consistent look. + +Basics +------ + +The argument vector `argv[]` may usually contain mandatory or optional +'non-option arguments', e.g. a filename or a branch, and 'options'. +Options are optional arguments that start with a dash and +that allow to change the behavior of a command. + +* There are basically three types of options: + 'boolean' options, + options with (mandatory) 'arguments' and + options with 'optional arguments' + (i.e. a boolean option that can be adjusted). + +* There are basically two forms of options: + 'Short options' consist of one dash (`-`) and one alphanumeric + character. + 'Long options' begin with two dashes (`--`) and some + alphanumeric characters. + +* Options are case-sensitive. + Please define 'lower-case long options' only. + +The parse-options API allows: + +* 'stuck' and 'separate form' of options with arguments. + `-oArg` is stuck, `-o Arg` is separate form. + `--option=Arg` is stuck, `--option Arg` is separate form. + +* Long options may be 'abbreviated', as long as the abbreviation + is unambiguous. + +* Short options may be bundled, e.g. `-a -b` can be specified as `-ab`. + +* Boolean long options can be 'negated' (or 'unset') by prepending + `no-`, e.g. `--no-abbrev` instead of `--abbrev`. Conversely, + options that begin with `no-` can be 'negated' by removing it. + Other long options can be unset (e.g., set string to NULL, set + integer to 0) by prepending `no-`. + +* Options and non-option arguments can clearly be separated using the `--` + option, e.g. `-a -b --option -- --this-is-a-file` indicates that + `--this-is-a-file` must not be processed as an option. + +Steps to parse options +---------------------- + +. `#include "parse-options.h"` + +. define a NULL-terminated + `static const char * const builtin_foo_usage[]` array + containing alternative usage strings + +. define `builtin_foo_options` array as described below + in section 'Data Structure'. + +. in `cmd_foo(int argc, const char **argv, const char *prefix)` + call + + argc = parse_options(argc, argv, prefix, builtin_foo_options, builtin_foo_usage, flags); ++ +`parse_options()` will filter out the processed options of `argv[]` and leave the +non-option arguments in `argv[]`. +`argc` is updated appropriately because of the assignment. ++ +You can also pass NULL instead of a usage array as the fifth parameter of +parse_options(), to avoid displaying a help screen with usage info and +option list. This should only be done if necessary, e.g. to implement +a limited parser for only a subset of the options that needs to be run +before the full parser, which in turn shows the full help message. ++ +Flags are the bitwise-or of: + +`PARSE_OPT_KEEP_DASHDASH`:: + Keep the `--` that usually separates options from + non-option arguments. + +`PARSE_OPT_STOP_AT_NON_OPTION`:: + Usually the whole argument vector is massaged and reordered. + Using this flag, processing is stopped at the first non-option + argument. + +`PARSE_OPT_KEEP_ARGV0`:: + Keep the first argument, which contains the program name. It's + removed from argv[] by default. + +`PARSE_OPT_KEEP_UNKNOWN`:: + Keep unknown arguments instead of erroring out. This doesn't + work for all combinations of arguments as users might expect + it to do. E.g. if the first argument in `--unknown --known` + takes a value (which we can't know), the second one is + mistakenly interpreted as a known option. Similarly, if + `PARSE_OPT_STOP_AT_NON_OPTION` is set, the second argument in + `--unknown value` will be mistakenly interpreted as a + non-option, not as a value belonging to the unknown option, + the parser early. That's why parse_options() errors out if + both options are set. + +`PARSE_OPT_NO_INTERNAL_HELP`:: + By default, parse_options() handles `-h`, `--help` and + `--help-all` internally, by showing a help screen. This option + turns it off and allows one to add custom handlers for these + options, or to just leave them unknown. + +Data Structure +-------------- + +The main data structure is an array of the `option` struct, +say `static struct option builtin_add_options[]`. +There are some macros to easily define options: + +`OPT__ABBREV(&int_var)`:: + Add `--abbrev[=<n>]`. + +`OPT__COLOR(&int_var, description)`:: + Add `--color[=<when>]` and `--no-color`. + +`OPT__DRY_RUN(&int_var, description)`:: + Add `-n, --dry-run`. + +`OPT__FORCE(&int_var, description)`:: + Add `-f, --force`. + +`OPT__QUIET(&int_var, description)`:: + Add `-q, --quiet`. + +`OPT__VERBOSE(&int_var, description)`:: + Add `-v, --verbose`. + +`OPT_GROUP(description)`:: + Start an option group. `description` is a short string that + describes the group or an empty string. + Start the description with an upper-case letter. + +`OPT_BOOL(short, long, &int_var, description)`:: + Introduce a boolean option. `int_var` is set to one with + `--option` and set to zero with `--no-option`. + +`OPT_COUNTUP(short, long, &int_var, description)`:: + Introduce a count-up option. + Each use of `--option` increments `int_var`, starting from zero + (even if initially negative), and `--no-option` resets it to + zero. To determine if `--option` or `--no-option` was encountered at + all, initialize `int_var` to a negative value, and if it is still + negative after parse_options(), then neither `--option` nor + `--no-option` was seen. + +`OPT_BIT(short, long, &int_var, description, mask)`:: + Introduce a boolean option. + If used, `int_var` is bitwise-ored with `mask`. + +`OPT_NEGBIT(short, long, &int_var, description, mask)`:: + Introduce a boolean option. + If used, `int_var` is bitwise-anded with the inverted `mask`. + +`OPT_SET_INT(short, long, &int_var, description, integer)`:: + Introduce an integer option. + `int_var` is set to `integer` with `--option`, and + reset to zero with `--no-option`. + +`OPT_STRING(short, long, &str_var, arg_str, description)`:: + Introduce an option with string argument. + The string argument is put into `str_var`. + +`OPT_STRING_LIST(short, long, &struct string_list, arg_str, description)`:: + Introduce an option with string argument. + The string argument is stored as an element in `string_list`. + Use of `--no-option` will clear the list of preceding values. + +`OPT_INTEGER(short, long, &int_var, description)`:: + Introduce an option with integer argument. + The integer is put into `int_var`. + +`OPT_MAGNITUDE(short, long, &unsigned_long_var, description)`:: + Introduce an option with a size argument. The argument must be a + non-negative integer and may include a suffix of 'k', 'm' or 'g' to + scale the provided value by 1024, 1024^2 or 1024^3 respectively. + The scaled value is put into `unsigned_long_var`. + +`OPT_EXPIRY_DATE(short, long, ×tamp_t_var, description)`:: + Introduce an option with expiry date argument, see `parse_expiry_date()`. + The timestamp is put into `timestamp_t_var`. + +`OPT_CALLBACK(short, long, &var, arg_str, description, func_ptr)`:: + Introduce an option with argument. + The argument will be fed into the function given by `func_ptr` + and the result will be put into `var`. + See 'Option Callbacks' below for a more elaborate description. + +`OPT_FILENAME(short, long, &var, description)`:: + Introduce an option with a filename argument. + The filename will be prefixed by passing the filename along with + the prefix argument of `parse_options()` to `prefix_filename()`. + +`OPT_ARGUMENT(long, &int_var, description)`:: + Introduce a long-option argument that will be kept in `argv[]`. + If this option was seen, `int_var` will be set to one (except + if a `NULL` pointer was passed). + +`OPT_NUMBER_CALLBACK(&var, description, func_ptr)`:: + Recognize numerical options like -123 and feed the integer as + if it was an argument to the function given by `func_ptr`. + The result will be put into `var`. There can be only one such + option definition. It cannot be negated and it takes no + arguments. Short options that happen to be digits take + precedence over it. + +`OPT_COLOR_FLAG(short, long, &int_var, description)`:: + Introduce an option that takes an optional argument that can + have one of three values: "always", "never", or "auto". If the + argument is not given, it defaults to "always". The `--no-` form + works like `--long=never`; it cannot take an argument. If + "always", set `int_var` to 1; if "never", set `int_var` to 0; if + "auto", set `int_var` to 1 if stdout is a tty or a pager, + 0 otherwise. + +`OPT_NOOP_NOARG(short, long)`:: + Introduce an option that has no effect and takes no arguments. + Use it to hide deprecated options that are still to be recognized + and ignored silently. + +`OPT_PASSTHRU(short, long, &char_var, arg_str, description, flags)`:: + Introduce an option that will be reconstructed into a char* string, + which must be initialized to NULL. This is useful when you need to + pass the command-line option to another command. Any previous value + will be overwritten, so this should only be used for options where + the last one specified on the command line wins. + +`OPT_PASSTHRU_ARGV(short, long, &argv_array_var, arg_str, description, flags)`:: + Introduce an option where all instances of it on the command-line will + be reconstructed into an argv_array. This is useful when you need to + pass the command-line option, which can be specified multiple times, + to another command. + +`OPT_CMDMODE(short, long, &int_var, description, enum_val)`:: + Define an "operation mode" option, only one of which in the same + group of "operating mode" options that share the same `int_var` + can be given by the user. `enum_val` is set to `int_var` when the + option is used, but an error is reported if other "operating mode" + option has already set its value to the same `int_var`. + + +The last element of the array must be `OPT_END()`. + +If not stated otherwise, interpret the arguments as follows: + +* `short` is a character for the short option + (e.g. `'e'` for `-e`, use `0` to omit), + +* `long` is a string for the long option + (e.g. `"example"` for `--example`, use `NULL` to omit), + +* `int_var` is an integer variable, + +* `str_var` is a string variable (`char *`), + +* `arg_str` is the string that is shown as argument + (e.g. `"branch"` will result in `<branch>`). + If set to `NULL`, three dots (`...`) will be displayed. + +* `description` is a short string to describe the effect of the option. + It shall begin with a lower-case letter and a full stop (`.`) shall be + omitted at the end. + +Option Callbacks +---------------- + +The function must be defined in this form: + + int func(const struct option *opt, const char *arg, int unset) + +The callback mechanism is as follows: + +* Inside `func`, the only interesting member of the structure + given by `opt` is the void pointer `opt->value`. + `*opt->value` will be the value that is saved into `var`, if you + use `OPT_CALLBACK()`. + For example, do `*(unsigned long *)opt->value = 42;` to get 42 + into an `unsigned long` variable. + +* Return value `0` indicates success and non-zero return + value will invoke `usage_with_options()` and, thus, die. + +* If the user negates the option, `arg` is `NULL` and `unset` is 1. + +Sophisticated option parsing +---------------------------- + +If you need, for example, option callbacks with optional arguments +or without arguments at all, or if you need other special cases, +that are not handled by the macros above, you need to specify the +members of the `option` structure manually. + +This is not covered in this document, but well documented +in `parse-options.h` itself. + +Examples +-------- + +See `test-parse-options.c` and +`builtin/add.c`, +`builtin/clone.c`, +`builtin/commit.c`, +`builtin/fetch.c`, +`builtin/fsck.c`, +`builtin/rm.c` +for real-world examples. diff --git a/Documentation/technical/api-quote.txt b/Documentation/technical/api-quote.txt new file mode 100644 index 000000000000..e8a1bce94e05 --- /dev/null +++ b/Documentation/technical/api-quote.txt @@ -0,0 +1,10 @@ +quote API +========= + +Talk about <quote.h>, things like + +* sq_quote and unquote +* c_style quote and unquote +* quoting for foreign languages + +(JC) diff --git a/Documentation/technical/api-ref-iteration.txt b/Documentation/technical/api-ref-iteration.txt new file mode 100644 index 000000000000..ad9d019ff9b4 --- /dev/null +++ b/Documentation/technical/api-ref-iteration.txt @@ -0,0 +1,78 @@ +ref iteration API +================= + + +Iteration of refs is done by using an iterate function which will call a +callback function for every ref. The callback function has this +signature: + + int handle_one_ref(const char *refname, const struct object_id *oid, + int flags, void *cb_data); + +There are different kinds of iterate functions which all take a +callback of this type. The callback is then called for each found ref +until the callback returns nonzero. The returned value is then also +returned by the iterate function. + +Iteration functions +------------------- + +* `head_ref()` just iterates the head ref. + +* `for_each_ref()` iterates all refs. + +* `for_each_ref_in()` iterates all refs which have a defined prefix and + strips that prefix from the passed variable refname. + +* `for_each_tag_ref()`, `for_each_branch_ref()`, `for_each_remote_ref()`, + `for_each_replace_ref()` iterate refs from the respective area. + +* `for_each_glob_ref()` iterates all refs that match the specified glob + pattern. + +* `for_each_glob_ref_in()` the previous and `for_each_ref_in()` combined. + +* Use `refs_` API for accessing submodules. The submodule ref store could + be obtained with `get_submodule_ref_store()`. + +* `for_each_rawref()` can be used to learn about broken ref and symref. + +* `for_each_reflog()` iterates each reflog file. + +Submodules +---------- + +If you want to iterate the refs of a submodule you first need to add the +submodules object database. You can do this by a code-snippet like +this: + + const char *path = "path/to/submodule" + if (add_submodule_odb(path)) + die("Error submodule '%s' not populated.", path); + +`add_submodule_odb()` will return zero on success. If you +do not do this you will get an error for each ref that it does not point +to a valid object. + +Note: As a side-effect of this you cannot safely assume that all +objects you lookup are available in superproject. All submodule objects +will be available the same way as the superprojects objects. + +Example: +-------- + +---- +static int handle_remote_ref(const char *refname, + const unsigned char *sha1, int flags, void *cb_data) +{ + struct strbuf *output = cb_data; + strbuf_addf(output, "%s\n", refname); + return 0; +} + +... + + struct strbuf output = STRBUF_INIT; + for_each_remote_ref(handle_remote_ref, &output); + printf("%s", output.buf); +---- diff --git a/Documentation/technical/api-remote.txt b/Documentation/technical/api-remote.txt new file mode 100644 index 000000000000..f10941b2e816 --- /dev/null +++ b/Documentation/technical/api-remote.txt @@ -0,0 +1,127 @@ +Remotes configuration API +========================= + +The API in remote.h gives access to the configuration related to +remotes. It handles all three configuration mechanisms historically +and currently used by Git, and presents the information in a uniform +fashion. Note that the code also handles plain URLs without any +configuration, giving them just the default information. + +struct remote +------------- + +`name`:: + + The user's nickname for the remote + +`url`:: + + An array of all of the url_nr URLs configured for the remote + +`pushurl`:: + + An array of all of the pushurl_nr push URLs configured for the remote + +`push`:: + + An array of refspecs configured for pushing, with + push_refspec being the literal strings, and push_refspec_nr + being the quantity. + +`fetch`:: + + An array of refspecs configured for fetching, with + fetch_refspec being the literal strings, and fetch_refspec_nr + being the quantity. + +`fetch_tags`:: + + The setting for whether to fetch tags (as a separate rule from + the configured refspecs); -1 means never to fetch tags, 0 + means to auto-follow tags based on the default heuristic, 1 + means to always auto-follow tags, and 2 means to fetch all + tags. + +`receivepack`, `uploadpack`:: + + The configured helper programs to run on the remote side, for + Git-native protocols. + +`http_proxy`:: + + The proxy to use for curl (http, https, ftp, etc.) URLs. + +`http_proxy_authmethod`:: + + The method used for authenticating against `http_proxy`. + +struct remotes can be found by name with remote_get(), and iterated +through with for_each_remote(). remote_get(NULL) will return the +default remote, given the current branch and configuration. + +struct refspec +-------------- + +A struct refspec holds the parsed interpretation of a refspec. If it +will force updates (starts with a '+'), force is true. If it is a +pattern (sides end with '*') pattern is true. src and dest are the +two sides (including '*' characters if present); if there is only one +side, it is src, and dst is NULL; if sides exist but are empty (i.e., +the refspec either starts or ends with ':'), the corresponding side is +"". + +An array of strings can be parsed into an array of struct refspecs +using parse_fetch_refspec() or parse_push_refspec(). + +remote_find_tracking(), given a remote and a struct refspec with +either src or dst filled out, will fill out the other such that the +result is in the "fetch" specification for the remote (note that this +evaluates patterns and returns a single result). + +struct branch +------------- + +Note that this may end up moving to branch.h + +struct branch holds the configuration for a branch. It can be looked +up with branch_get(name) for "refs/heads/{name}", or with +branch_get(NULL) for HEAD. + +It contains: + +`name`:: + + The short name of the branch. + +`refname`:: + + The full path for the branch ref. + +`remote_name`:: + + The name of the remote listed in the configuration. + +`merge_name`:: + + An array of the "merge" lines in the configuration. + +`merge`:: + + An array of the struct refspecs used for the merge lines. That + is, merge[i]->dst is a local tracking ref which should be + merged into this branch by default. + +`merge_nr`:: + + The number of merge configurations + +branch_has_merge_config() returns true if the given branch has merge +configuration given. + +Other stuff +----------- + +There is other stuff in remote.h that is related, in general, to the +process of interacting with remotes. + +(Daniel Barkalow) diff --git a/Documentation/technical/api-revision-walking.txt b/Documentation/technical/api-revision-walking.txt new file mode 100644 index 000000000000..03f9ea6ac4ba --- /dev/null +++ b/Documentation/technical/api-revision-walking.txt @@ -0,0 +1,72 @@ +revision walking API +==================== + +The revision walking API offers functions to build a list of revisions +and then iterate over that list. + +Calling sequence +---------------- + +The walking API has a given calling sequence: first you need to +initialize a rev_info structure, then add revisions to control what kind +of revision list do you want to get, finally you can iterate over the +revision list. + +Functions +--------- + +`repo_init_revisions`:: + + Initialize a rev_info structure with default values. The third + parameter may be NULL or can be prefix path, and then the `.prefix` + variable will be set to it. This is typically the first function you + want to call when you want to deal with a revision list. After calling + this function, you are free to customize options, like set + `.ignore_merges` to 0 if you don't want to ignore merges, and so on. See + `revision.h` for a complete list of available options. + +`add_pending_object`:: + + This function can be used if you want to add commit objects as revision + information. You can use the `UNINTERESTING` object flag to indicate if + you want to include or exclude the given commit (and commits reachable + from the given commit) from the revision list. ++ +NOTE: If you have the commits as a string list then you probably want to +use setup_revisions(), instead of parsing each string and using this +function. + +`setup_revisions`:: + + Parse revision information, filling in the `rev_info` structure, and + removing the used arguments from the argument list. Returns the number + of arguments left that weren't recognized, which are also moved to the + head of the argument list. The last parameter is used in case no + parameter given by the first two arguments. + +`prepare_revision_walk`:: + + Prepares the rev_info structure for a walk. You should check if it + returns any error (non-zero return code) and if it does not, you can + start using get_revision() to do the iteration. + +`get_revision`:: + + Takes a pointer to a `rev_info` structure and iterates over it, + returning a `struct commit *` each time you call it. The end of the + revision list is indicated by returning a NULL pointer. + +`reset_revision_walk`:: + + Reset the flags used by the revision walking api. You can use + this to do multiple sequential revision walks. + +Data structures +--------------- + +Talk about <revision.h>, things like: + +* two diff_options, one for path limiting, another for output; +* remaining functions; + +(Linus, JC, Dscho) diff --git a/Documentation/technical/api-run-command.txt b/Documentation/technical/api-run-command.txt new file mode 100644 index 000000000000..8bf3e37f5375 --- /dev/null +++ b/Documentation/technical/api-run-command.txt @@ -0,0 +1,264 @@ +run-command API +=============== + +The run-command API offers a versatile tool to run sub-processes with +redirected input and output as well as with a modified environment +and an alternate current directory. + +A similar API offers the capability to run a function asynchronously, +which is primarily used to capture the output that the function +produces in the caller in order to process it. + + +Functions +--------- + +`child_process_init`:: + + Initialize a struct child_process variable. + +`start_command`:: + + Start a sub-process. Takes a pointer to a `struct child_process` + that specifies the details and returns pipe FDs (if requested). + See below for details. + +`finish_command`:: + + Wait for the completion of a sub-process that was started with + start_command(). + +`run_command`:: + + A convenience function that encapsulates a sequence of + start_command() followed by finish_command(). Takes a pointer + to a `struct child_process` that specifies the details. + +`run_command_v_opt`, `run_command_v_opt_cd_env`:: + + Convenience functions that encapsulate a sequence of + start_command() followed by finish_command(). The argument argv + specifies the program and its arguments. The argument opt is zero + or more of the flags `RUN_COMMAND_NO_STDIN`, `RUN_GIT_CMD`, + `RUN_COMMAND_STDOUT_TO_STDERR`, or `RUN_SILENT_EXEC_FAILURE` + that correspond to the members .no_stdin, .git_cmd, + .stdout_to_stderr, .silent_exec_failure of `struct child_process`. + The argument dir corresponds the member .dir. The argument env + corresponds to the member .env. + +`child_process_clear`:: + + Release the memory associated with the struct child_process. + Most users of the run-command API don't need to call this + function explicitly because `start_command` invokes it on + failure and `finish_command` calls it automatically already. + +The functions above do the following: + +. If a system call failed, errno is set and -1 is returned. A diagnostic + is printed. + +. If the program was not found, then -1 is returned and errno is set to + ENOENT; a diagnostic is printed only if .silent_exec_failure is 0. + +. Otherwise, the program is run. If it terminates regularly, its exit + code is returned. No diagnostic is printed, even if the exit code is + non-zero. + +. If the program terminated due to a signal, then the return value is the + signal number + 128, ie. the same value that a POSIX shell's $? would + report. A diagnostic is printed. + + +`start_async`:: + + Run a function asynchronously. Takes a pointer to a `struct + async` that specifies the details and returns a set of pipe FDs + for communication with the function. See below for details. + +`finish_async`:: + + Wait for the completion of an asynchronous function that was + started with start_async(). + +`run_hook`:: + + Run a hook. + The first argument is a pathname to an index file, or NULL + if the hook uses the default index file or no index is needed. + The second argument is the name of the hook. + The further arguments correspond to the hook arguments. + The last argument has to be NULL to terminate the arguments list. + If the hook does not exist or is not executable, the return + value will be zero. + If it is executable, the hook will be executed and the exit + status of the hook is returned. + On execution, .stdout_to_stderr and .no_stdin will be set. + (See below.) + + +Data structures +--------------- + +* `struct child_process` + +This describes the arguments, redirections, and environment of a +command to run in a sub-process. + +The caller: + +1. allocates and clears (using child_process_init() or + CHILD_PROCESS_INIT) a struct child_process variable; +2. initializes the members; +3. calls start_command(); +4. processes the data; +5. closes file descriptors (if necessary; see below); +6. calls finish_command(). + +The .argv member is set up as an array of string pointers (NULL +terminated), of which .argv[0] is the program name to run (usually +without a path). If the command to run is a git command, set argv[0] to +the command name without the 'git-' prefix and set .git_cmd = 1. + +Note that the ownership of the memory pointed to by .argv stays with the +caller, but it should survive until `finish_command` completes. If the +.argv member is NULL, `start_command` will point it at the .args +`argv_array` (so you may use one or the other, but you must use exactly +one). The memory in .args will be cleaned up automatically during +`finish_command` (or during `start_command` when it is unsuccessful). + +The members .in, .out, .err are used to redirect stdin, stdout, +stderr as follows: + +. Specify 0 to request no special redirection. No new file descriptor + is allocated. The child process simply inherits the channel from the + parent. + +. Specify -1 to have a pipe allocated; start_command() replaces -1 + by the pipe FD in the following way: + + .in: Returns the writable pipe end into which the caller writes; + the readable end of the pipe becomes the child's stdin. + + .out, .err: Returns the readable pipe end from which the caller + reads; the writable end of the pipe end becomes child's + stdout/stderr. + + The caller of start_command() must close the so returned FDs + after it has completed reading from/writing to it! + +. Specify a file descriptor > 0 to be used by the child: + + .in: The FD must be readable; it becomes child's stdin. + .out: The FD must be writable; it becomes child's stdout. + .err: The FD must be writable; it becomes child's stderr. + + The specified FD is closed by start_command(), even if it fails to + run the sub-process! + +. Special forms of redirection are available by setting these members + to 1: + + .no_stdin, .no_stdout, .no_stderr: The respective channel is + redirected to /dev/null. + + .stdout_to_stderr: stdout of the child is redirected to its + stderr. This happens after stderr is itself redirected. + So stdout will follow stderr to wherever it is + redirected. + +To modify the environment of the sub-process, specify an array of +string pointers (NULL terminated) in .env: + +. If the string is of the form "VAR=value", i.e. it contains '=' + the variable is added to the child process's environment. + +. If the string does not contain '=', it names an environment + variable that will be removed from the child process's environment. + +If the .env member is NULL, `start_command` will point it at the +.env_array `argv_array` (so you may use one or the other, but not both). +The memory in .env_array will be cleaned up automatically during +`finish_command` (or during `start_command` when it is unsuccessful). + +To specify a new initial working directory for the sub-process, +specify it in the .dir member. + +If the program cannot be found, the functions return -1 and set +errno to ENOENT. Normally, an error message is printed, but if +.silent_exec_failure is set to 1, no message is printed for this +special error condition. + + +* `struct async` + +This describes a function to run asynchronously, whose purpose is +to produce output that the caller reads. + +The caller: + +1. allocates and clears (memset(&asy, 0, sizeof(asy));) a + struct async variable; +2. initializes .proc and .data; +3. calls start_async(); +4. processes communicates with proc through .in and .out; +5. closes .in and .out; +6. calls finish_async(). + +The members .in, .out are used to provide a set of fd's for +communication between the caller and the callee as follows: + +. Specify 0 to have no file descriptor passed. The callee will + receive -1 in the corresponding argument. + +. Specify < 0 to have a pipe allocated; start_async() replaces + with the pipe FD in the following way: + + .in: Returns the writable pipe end into which the caller + writes; the readable end of the pipe becomes the function's + in argument. + + .out: Returns the readable pipe end from which the caller + reads; the writable end of the pipe becomes the function's + out argument. + + The caller of start_async() must close the returned FDs after it + has completed reading from/writing from them. + +. Specify a file descriptor > 0 to be used by the function: + + .in: The FD must be readable; it becomes the function's in. + .out: The FD must be writable; it becomes the function's out. + + The specified FD is closed by start_async(), even if it fails to + run the function. + +The function pointer in .proc has the following signature: + + int proc(int in, int out, void *data); + +. in, out specifies a set of file descriptors to which the function + must read/write the data that it needs/produces. The function + *must* close these descriptors before it returns. A descriptor + may be -1 if the caller did not configure a descriptor for that + direction. + +. data is the value that the caller has specified in the .data member + of struct async. + +. The return value of the function is 0 on success and non-zero + on failure. If the function indicates failure, finish_async() will + report failure as well. + + +There are serious restrictions on what the asynchronous function can do +because this facility is implemented by a thread in the same address +space on most platforms (when pthreads is available), but by a pipe to +a forked process otherwise: + +. It cannot change the program's state (global variables, environment, + etc.) in a way that the caller notices; in other words, .in and .out + are the only communication channels to the caller. + +. It must not change the program's state that the caller of the + facility also uses. diff --git a/Documentation/technical/api-setup.txt b/Documentation/technical/api-setup.txt new file mode 100644 index 000000000000..eb1fa9853ef6 --- /dev/null +++ b/Documentation/technical/api-setup.txt @@ -0,0 +1,47 @@ +setup API +========= + +Talk about + +* setup_git_directory() +* setup_git_directory_gently() +* is_inside_git_dir() +* is_inside_work_tree() +* setup_work_tree() + +(Dscho) + +Pathspec +-------- + +See glossary-context.txt for the syntax of pathspec. In memory, a +pathspec set is represented by "struct pathspec" and is prepared by +parse_pathspec(). This function takes several arguments: + +- magic_mask specifies what features that are NOT supported by the + following code. If a user attempts to use such a feature, + parse_pathspec() can reject it early. + +- flags specifies other things that the caller wants parse_pathspec to + perform. + +- prefix and args come from cmd_* functions + +parse_pathspec() helps catch unsupported features and reject them +politely. At a lower level, different pathspec-related functions may +not support the same set of features. Such pathspec-sensitive +functions are guarded with GUARD_PATHSPEC(), which will die in an +unfriendly way when an unsupported feature is requested. + +The command designers are supposed to make sure that GUARD_PATHSPEC() +never dies. They have to make sure all unsupported features are caught +by parse_pathspec(), not by GUARD_PATHSPEC. grepping GUARD_PATHSPEC() +should give the designers all pathspec-sensitive codepaths and what +features they support. + +A similar process is applied when a new pathspec magic is added. The +designer lifts the GUARD_PATHSPEC restriction in the functions that +support the new magic. At the same time (s)he has to make sure this +new feature will be caught at parse_pathspec() in commands that cannot +handle the new magic in some cases. grepping parse_pathspec() should +help. diff --git a/Documentation/technical/api-sigchain.txt b/Documentation/technical/api-sigchain.txt new file mode 100644 index 000000000000..9e1189ef01df --- /dev/null +++ b/Documentation/technical/api-sigchain.txt @@ -0,0 +1,41 @@ +sigchain API +============ + +Code often wants to set a signal handler to clean up temporary files or +other work-in-progress when we die unexpectedly. For multiple pieces of +code to do this without conflicting, each piece of code must remember +the old value of the handler and restore it either when: + + 1. The work-in-progress is finished, and the handler is no longer + necessary. The handler should revert to the original behavior + (either another handler, SIG_DFL, or SIG_IGN). + + 2. The signal is received. We should then do our cleanup, then chain + to the next handler (or die if it is SIG_DFL). + +Sigchain is a tiny library for keeping a stack of handlers. Your handler +and installation code should look something like: + +------------------------------------------ + void clean_foo_on_signal(int sig) + { + clean_foo(); + sigchain_pop(sig); + raise(sig); + } + + void other_func() + { + sigchain_push_common(clean_foo_on_signal); + mess_up_foo(); + clean_foo(); + } +------------------------------------------ + +Handlers are given the typedef of sigchain_fun. This is the same type +that is given to signal() or sigaction(). It is perfectly reasonable to +push SIG_DFL or SIG_IGN onto the stack. + +You can sigchain_push and sigchain_pop individual signals. For +convenience, sigchain_push_common will push the handler onto the stack +for many common signals. diff --git a/Documentation/technical/api-submodule-config.txt b/Documentation/technical/api-submodule-config.txt new file mode 100644 index 000000000000..fb060893931f --- /dev/null +++ b/Documentation/technical/api-submodule-config.txt @@ -0,0 +1,66 @@ +submodule config cache API +========================== + +The submodule config cache API allows to read submodule +configurations/information from specified revisions. Internally +information is lazily read into a cache that is used to avoid +unnecessary parsing of the same .gitmodules files. Lookups can be done by +submodule path or name. + +Usage +----- + +To initialize the cache with configurations from the worktree the caller +typically first calls `gitmodules_config()` to read values from the +worktree .gitmodules and then to overlay the local git config values +`parse_submodule_config_option()` from the config parsing +infrastructure. + +The caller can look up information about submodules by using the +`submodule_from_path()` or `submodule_from_name()` functions. They return +a `struct submodule` which contains the values. The API automatically +initializes and allocates the needed infrastructure on-demand. If the +caller does only want to lookup values from revisions the initialization +can be skipped. + +If the internal cache might grow too big or when the caller is done with +the API, all internally cached values can be freed with submodule_free(). + +Data Structures +--------------- + +`struct submodule`:: + + This structure is used to return the information about one + submodule for a certain revision. It is returned by the lookup + functions. + +Functions +--------- + +`void submodule_free(struct repository *r)`:: + + Use these to free the internally cached values. + +`int parse_submodule_config_option(const char *var, const char *value)`:: + + Can be passed to the config parsing infrastructure to parse + local (worktree) submodule configurations. + +`const struct submodule *submodule_from_path(const unsigned char *treeish_name, const char *path)`:: + + Given a tree-ish in the superproject and a path, return the + submodule that is bound at the path in the named tree. + +`const struct submodule *submodule_from_name(const unsigned char *treeish_name, const char *name)`:: + + The same as above but lookup by name. + +Whenever a submodule configuration is parsed in `parse_submodule_config_option` +via e.g. `gitmodules_config()`, it will overwrite the null_sha1 entry. +So in the normal case, when HEAD:.gitmodules is parsed first and then overlayed +with the repository configuration, the null_sha1 entry contains the local +configuration of a submodule (e.g. consolidated values from local git +configuration and the .gitmodules file in the worktree). + +For an example usage see test-submodule-config.c. diff --git a/Documentation/technical/api-trace.txt b/Documentation/technical/api-trace.txt new file mode 100644 index 000000000000..fadb5979c48b --- /dev/null +++ b/Documentation/technical/api-trace.txt @@ -0,0 +1,140 @@ +trace API +========= + +The trace API can be used to print debug messages to stderr or a file. Trace +code is inactive unless explicitly enabled by setting `GIT_TRACE*` environment +variables. + +The trace implementation automatically adds `timestamp file:line ... \n` to +all trace messages. E.g.: + +------------ +23:59:59.123456 git.c:312 trace: built-in: git 'foo' +00:00:00.000001 builtin/foo.c:99 foo: some message +------------ + +Data Structures +--------------- + +`struct trace_key`:: + + Defines a trace key (or category). The default (for API functions that + don't take a key) is `GIT_TRACE`. ++ +E.g. to define a trace key controlled by environment variable `GIT_TRACE_FOO`: ++ +------------ +static struct trace_key trace_foo = TRACE_KEY_INIT(FOO); + +static void trace_print_foo(const char *message) +{ + trace_printf_key(&trace_foo, "%s", message); +} +------------ ++ +Note: don't use `const` as the trace implementation stores internal state in +the `trace_key` structure. + +Functions +--------- + +`int trace_want(struct trace_key *key)`:: + + Checks whether the trace key is enabled. Used to prevent expensive + string formatting before calling one of the printing APIs. + +`void trace_disable(struct trace_key *key)`:: + + Disables tracing for the specified key, even if the environment + variable was set. + +`void trace_printf(const char *format, ...)`:: +`void trace_printf_key(struct trace_key *key, const char *format, ...)`:: + + Prints a formatted message, similar to printf. + +`void trace_argv_printf(const char **argv, const char *format, ...)``:: + + Prints a formatted message, followed by a quoted list of arguments. + +`void trace_strbuf(struct trace_key *key, const struct strbuf *data)`:: + + Prints the strbuf, without additional formatting (i.e. doesn't + choke on `%` or even `\0`). + +`uint64_t getnanotime(void)`:: + + Returns nanoseconds since the epoch (01/01/1970), typically used + for performance measurements. ++ +Currently there are high precision timer implementations for Linux (using +`clock_gettime(CLOCK_MONOTONIC)`) and Windows (`QueryPerformanceCounter`). +Other platforms use `gettimeofday` as time source. + +`void trace_performance(uint64_t nanos, const char *format, ...)`:: +`void trace_performance_since(uint64_t start, const char *format, ...)`:: + + Prints the elapsed time (in nanoseconds), or elapsed time since + `start`, followed by a formatted message. Enabled via environment + variable `GIT_TRACE_PERFORMANCE`. Used for manual profiling, e.g.: ++ +------------ +uint64_t start = getnanotime(); +/* code section to measure */ +trace_performance_since(start, "foobar"); +------------ ++ +------------ +uint64_t t = 0; +for (;;) { + /* ignore */ + t -= getnanotime(); + /* code section to measure */ + t += getnanotime(); + /* ignore */ +} +trace_performance(t, "frotz"); +------------ + +Bugs & Caveats +-------------- + +GIT_TRACE_* environment variables can be used to tell Git to show +trace output to its standard error stream. Git can often spawn a pager +internally to run its subcommand and send its standard output and +standard error to it. + +Because GIT_TRACE_PERFORMANCE trace is generated only at the very end +of the program with atexit(), which happens after the pager exits, it +would not work well if you send its log to the standard error output +and let Git spawn the pager at the same time. + +As a work around, you can for example use '--no-pager', or set +GIT_TRACE_PERFORMANCE to another file descriptor which is redirected +to stderr, or set GIT_TRACE_PERFORMANCE to a file specified by its +absolute path. + +For example instead of the following command which by default may not +print any performance information: + +------------ +GIT_TRACE_PERFORMANCE=2 git log -1 +------------ + +you may want to use: + +------------ +GIT_TRACE_PERFORMANCE=2 git --no-pager log -1 +------------ + +or: + +------------ +GIT_TRACE_PERFORMANCE=3 3>&2 git log -1 +------------ + +or: + +------------ +GIT_TRACE_PERFORMANCE=/path/to/log/file git log -1 +------------ diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt new file mode 100644 index 000000000000..71eb081fed25 --- /dev/null +++ b/Documentation/technical/api-trace2.txt @@ -0,0 +1,1378 @@ += Trace2 API + +The Trace2 API can be used to print debug, performance, and telemetry +information to stderr or a file. The Trace2 feature is inactive unless +explicitly enabled by enabling one or more Trace2 Targets. + +The Trace2 API is intended to replace the existing (Trace1) +printf-style tracing provided by the existing `GIT_TRACE` and +`GIT_TRACE_PERFORMANCE` facilities. During initial implementation, +Trace2 and Trace1 may operate in parallel. + +The Trace2 API defines a set of high-level messages with known fields, +such as (`start`: `argv`) and (`exit`: {`exit-code`, `elapsed-time`}). + +Trace2 instrumentation throughout the Git code base sends Trace2 +messages to the enabled Trace2 Targets. Targets transform these +messages content into purpose-specific formats and write events to +their data streams. In this manner, the Trace2 API can drive +many different types of analysis. + +Targets are defined using a VTable allowing easy extension to other +formats in the future. This might be used to define a binary format, +for example. + +Trace2 is controlled using `trace2.*` config values in the system and +global config files and `GIT_TRACE2*` environment variables. Trace2 does +not read from repo local or worktree config files or respect `-c` +command line config settings. + +== Trace2 Targets + +Trace2 defines the following set of Trace2 Targets. +Format details are given in a later section. + +=== The Normal Format Target + +The normal format target is a tradition printf format and similar +to GIT_TRACE format. This format is enabled with the `GIT_TRACE2` +environment variable or the `trace2.normalTarget` system or global +config setting. + +For example + +------------ +$ export GIT_TRACE2=~/log.normal +$ git version +git version 2.20.1.155.g426c96fcdb +------------ + +or + +------------ +$ git config --global trace2.normalTarget ~/log.normal +$ git version +git version 2.20.1.155.g426c96fcdb +------------ + +yields + +------------ +$ cat ~/log.normal +12:28:42.620009 common-main.c:38 version 2.20.1.155.g426c96fcdb +12:28:42.620989 common-main.c:39 start git version +12:28:42.621101 git.c:432 cmd_name version (version) +12:28:42.621215 git.c:662 exit elapsed:0.001227 code:0 +12:28:42.621250 trace2/tr2_tgt_normal.c:124 atexit elapsed:0.001265 code:0 +------------ + +=== The Performance Format Target + +The performance format target (PERF) is a column-based format to +replace GIT_TRACE_PERFORMANCE and is suitable for development and +testing, possibly to complement tools like gprof. This format is +enabled with the `GIT_TRACE2_PERF` environment variable or the +`trace2.perfTarget` system or global config setting. + +For example + +------------ +$ export GIT_TRACE2_PERF=~/log.perf +$ git version +git version 2.20.1.155.g426c96fcdb +------------ + +or + +------------ +$ git config --global trace2.perfTarget ~/log.perf +$ git version +git version 2.20.1.155.g426c96fcdb +------------ + +yields + +------------ +$ cat ~/log.perf +12:28:42.620675 common-main.c:38 | d0 | main | version | | | | | 2.20.1.155.g426c96fcdb +12:28:42.621001 common-main.c:39 | d0 | main | start | | 0.001173 | | | git version +12:28:42.621111 git.c:432 | d0 | main | cmd_name | | | | | version (version) +12:28:42.621225 git.c:662 | d0 | main | exit | | 0.001227 | | | code:0 +12:28:42.621259 trace2/tr2_tgt_perf.c:211 | d0 | main | atexit | | 0.001265 | | | code:0 +------------ + +=== The Event Format Target + +The event format target is a JSON-based format of event data suitable +for telemetry analysis. This format is enabled with the `GIT_TRACE2_EVENT` +environment variable or the `trace2.eventTarget` system or global config +setting. + +For example + +------------ +$ export GIT_TRACE2_EVENT=~/log.event +$ git version +git version 2.20.1.155.g426c96fcdb +------------ + +or + +------------ +$ git config --global trace2.eventTarget ~/log.event +$ git version +git version 2.20.1.155.g426c96fcdb +------------ + +yields + +------------ +$ cat ~/log.event +{"event":"version","sid":"sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.620713Z","file":"common-main.c","line":38,"evt":"1","exe":"2.20.1.155.g426c96fcdb"} +{"event":"start","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621027Z","file":"common-main.c","line":39,"t_abs":0.001173,"argv":["git","version"]} +{"event":"cmd_name","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621122Z","file":"git.c","line":432,"name":"version","hierarchy":"version"} +{"event":"exit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621236Z","file":"git.c","line":662,"t_abs":0.001227,"code":0} +{"event":"atexit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621268Z","file":"trace2/tr2_tgt_event.c","line":163,"t_abs":0.001265,"code":0} +------------ + +=== Enabling a Target + +To enable a target, set the corresponding environment variable or +system or global config value to one of the following: + +include::../trace2-target-values.txt[] + +If the target already exists and is a directory, the traces will be +written to files (one per process) underneath the given directory. They +will be named according to the last component of the SID (optionally +followed by a counter to avoid filename collisions). + +== Trace2 API + +All public Trace2 functions and macros are defined in `trace2.h` and +`trace2.c`. All public symbols are prefixed with `trace2_`. + +There are no public Trace2 data structures. + +The Trace2 code also defines a set of private functions and data types +in the `trace2/` directory. These symbols are prefixed with `tr2_` +and should only be used by functions in `trace2.c`. + +== Conventions for Public Functions and Macros + +The functions defined by the Trace2 API are declared and documented +in `trace2.h`. It defines the API functions and wrapper macros for +Trace2. + +Some functions have a `_fl()` suffix to indicate that they take `file` +and `line-number` arguments. + +Some functions have a `_va_fl()` suffix to indicate that they also +take a `va_list` argument. + +Some functions have a `_printf_fl()` suffix to indicate that they also +take a varargs argument. + +There are CPP wrapper macros and ifdefs to hide most of these details. +See `trace2.h` for more details. The following discussion will only +describe the simplified forms. + +== Public API + +All Trace2 API functions send a messsage to all of the active +Trace2 Targets. This section describes the set of available +messages. + +It helps to divide these functions into groups for discussion +purposes. + +=== Basic Command Messages + +These are concerned with the lifetime of the overall git process. + +`void trace2_initialize_clock()`:: + + Initialize the Trace2 start clock and nothing else. This should + be called at the very top of main() to capture the process start + time and reduce startup order dependencies. + +`void trace2_initialize()`:: + + Determines if any Trace2 Targets should be enabled and + initializes the Trace2 facility. This includes setting up the + Trace2 thread local storage (TLS). ++ +This function emits a "version" message containing the version of git +and the Trace2 protocol. ++ +This function should be called from `main()` as early as possible in +the life of the process after essential process initialization. + +`int trace2_is_enabled()`:: + + Returns 1 if Trace2 is enabled (at least one target is + active). + +`void trace2_cmd_start(int argc, const char **argv)`:: + + Emits a "start" message containing the process command line + arguments. + +`int trace2_cmd_exit(int exit_code)`:: + + Emits an "exit" message containing the process exit-code and + elapsed time. ++ +Returns the exit-code. + +`void trace2_cmd_error(const char *fmt, va_list ap)`:: + + Emits an "error" message containing a formatted error message. + +`void trace2_cmd_path(const char *pathname)`:: + + Emits a "cmd_path" message with the full pathname of the + current process. + +=== Command Detail Messages + +These are concerned with describing the specific Git command +after the command line, config, and environment are inspected. + +`void trace2_cmd_name(const char *name)`:: + + Emits a "cmd_name" message with the canonical name of the + command, for example "status" or "checkout". + +`void trace2_cmd_mode(const char *mode)`:: + + Emits a "cmd_mode" message with a qualifier name to further + describe the current git command. ++ +This message is intended to be used with git commands having multiple +major modes. For example, a "checkout" command can checkout a new +branch or it can checkout a single file, so the checkout code could +emit a cmd_mode message of "branch" or "file". + +`void trace2_cmd_alias(const char *alias, const char **argv_expansion)`:: + + Emits an "alias" message containing the alias used and the + argument expansion. + +`void trace2_def_param(const char *parameter, const char *value)`:: + + Emits a "def_param" message containing a key/value pair. ++ +This message is intended to report some global aspect of the current +command, such as a configuration setting or command line switch that +significantly affects program performance or behavior, such as +`core.abbrev`, `status.showUntrackedFiles`, or `--no-ahead-behind`. + +`void trace2_cmd_list_config()`:: + + Emits a "def_param" messages for "important" configuration + settings. ++ +The environment variable `GIT_TRACE2_CONFIG_PARAMS` or the `trace2.configParams` +config value can be set to a +list of patterns of important configuration settings, for example: +`core.*,remote.*.url`. This function will iterate over all config +settings and emit a "def_param" message for each match. + +`void trace2_cmd_set_config(const char *key, const char *value)`:: + + Emits a "def_param" message for a new or updated key/value + pair IF `key` is considered important. ++ +This is used to hook into `git_config_set()` and catch any +configuration changes and update a value previously reported by +`trace2_cmd_list_config()`. + +`void trace2_def_repo(struct repository *repo)`:: + + Registers a repository with the Trace2 layer. Assigns a + unique "repo-id" to `repo->trace2_repo_id`. ++ +Emits a "worktree" messages containing the repo-id and the worktree +pathname. ++ +Region and data messages (described later) may refer to this repo-id. ++ +The main/top-level repository will have repo-id value 1 (aka "r1"). ++ +The repo-id field is in anticipation of future in-proc submodule +repositories. + +=== Child Process Messages + +These are concerned with the various spawned child processes, +including shell scripts, git commands, editors, pagers, and hooks. + +`void trace2_child_start(struct child_process *cmd)`:: + + Emits a "child_start" message containing the "child-id", + "child-argv", and "child-classification". ++ +Before calling this, set `cmd->trace2_child_class` to a name +describing the type of child process, for example "editor". ++ +This function assigns a unique "child-id" to `cmd->trace2_child_id`. +This field is used later during the "child_exit" message to associate +it with the "child_start" message. ++ +This function should be called before spawning the child process. + +`void trace2_child_exit(struct child_proess *cmd, int child_exit_code)`:: + + Emits a "child_exit" message containing the "child-id", + the child's elapsed time and exit-code. ++ +The reported elapsed time includes the process creation overhead and +time spend waiting for it to exit, so it may be slightly longer than +the time reported by the child itself. ++ +This function should be called after reaping the child process. + +`int trace2_exec(const char *exe, const char **argv)`:: + + Emits a "exec" message containing the "exec-id" and the + argv of the new process. ++ +This function should be called before calling one of the `exec()` +variants, such as `execvp()`. ++ +This function returns a unique "exec-id". This value is used later +if the exec() fails and a "exec-result" message is necessary. + +`void trace2_exec_result(int exec_id, int error_code)`:: + + Emits a "exec_result" message containing the "exec-id" + and the error code. ++ +On Unix-based systems, `exec()` does not return if successful. +This message is used to indicate that the `exec()` failed and +that the current program is continuing. + +=== Git Thread Messages + +These messages are concerned with Git thread usage. + +`void trace2_thread_start(const char *thread_name)`:: + + Emits a "thread_start" message. ++ +The `thread_name` field should be a descriptive name, such as the +unique name of the thread-proc. A unique "thread-id" will be added +to the name to uniquely identify thread instances. ++ +Region and data messages (described later) may refer to this thread +name. ++ +This function must be called by the thread-proc of the new thread +(so that TLS data is properly initialized) and not by the caller +of `pthread_create()`. + +`void trace2_thread_exit()`:: + + Emits a "thread_exit" message containing the thread name + and the thread elapsed time. ++ +This function must be called by the thread-proc before it returns +(so that the coorect TLS data is used and cleaned up. It should +not be called by the caller of `pthread_join()`. + +=== Region and Data Messages + +These are concerned with recording performance data +over regions or spans of code. + +`void trace2_region_enter(const char *category, const char *label, const struct repository *repo)`:: + +`void trace2_region_enter_printf(const char *category, const char *label, const struct repository *repo, const char *fmt, ...)`:: + +`void trace2_region_enter_printf_va(const char *category, const char *label, const struct repository *repo, const char *fmt, va_list ap)`:: + + Emits a thread-relative "region_enter" message with optional + printf string. ++ +This function pushes a new region nesting stack level on the current +thread and starts a clock for the new stack frame. ++ +The `category` field is an arbitrary category name used to classify +regions by feature area, such as "status" or "index". At this time +it is only just printed along with the rest of the message. It may +be used in the future to filter messages. ++ +The `label` field is an arbitrary label used to describe the activity +being started, such as "read_recursive" or "do_read_index". ++ +The `repo` field, if set, will be used to get the "repo-id", so that +recursive oerations can be attributed to the correct repository. + +`void trace2_region_leave(const char *category, const char *label, const struct repository *repo)`:: + +`void trace2_region_leave_printf(const char *category, const char *label, const struct repository *repo, const char *fmt, ...)`:: + +`void trace2_region_leave_printf_va(const char *category, const char *label, const struct repository *repo, const char *fmt, va_list ap)`:: + + Emits a thread-relative "region_leave" message with optional + printf string. ++ +This function pops the region nesting stack on the current thread +and reports the elapsed time of the stack frame. ++ +The `category`, `label`, and `repo` fields are the same as above. +The `category` and `label` do not need to match the correpsonding +"region_enter" message, but it makes the data stream easier to +understand. + +`void trace2_data_string(const char *category, const struct repository *repo, const char *key, const char * value)`:: + +`void trace2_data_intmax(const char *category, const struct repository *repo, const char *key, intmax value)`:: + +`void trace2_data_json(const char *category, const struct repository *repo, const char *key, const struct json_writer *jw)`:: + + Emits a region- and thread-relative "data" or "data_json" message. ++ +This is a key/value pair message containing information about the +current thread, region stack, and repository. This could be used +to print the number of files in a directory during a multi-threaded +recursive tree walk. + +`void trace2_printf(const char *fmt, ...)`:: + +`void trace2_printf_va(const char *fmt, va_list ap)`:: + + Emits a region- and thread-relative "printf" message. + +== Trace2 Target Formats + +=== NORMAL Format + +Events are written as lines of the form: + +------------ +[<time> SP <filename>:<line> SP+] <event-name> [[SP] <event-message>] LF +------------ + +`<event-name>`:: + + is the event name. + +`<event-message>`:: + is a free-form printf message intended for human consumption. ++ +Note that this may contain embedded LF or CRLF characters that are +not escaped, so the event may spill across multiple lines. + +If `GIT_TRACE2_BRIEF` or `trace2.normalBrief` is true, the `time`, `filename`, +and `line` fields are omitted. + +This target is intended to be more of a summary (like GIT_TRACE) and +less detailed than the other targets. It ignores thread, region, and +data messages, for example. + +=== PERF Format + +Events are written as lines of the form: + +------------ +[<time> SP <filename>:<line> SP+ + BAR SP] d<depth> SP + BAR SP <thread-name> SP+ + BAR SP <event-name> SP+ + BAR SP [r<repo-id>] SP+ + BAR SP [<t_abs>] SP+ + BAR SP [<t_rel>] SP+ + BAR SP [<category>] SP+ + BAR SP DOTS* <perf-event-message> + LF +------------ + +`<depth>`:: + is the git process depth. This is the number of parent + git processes. A top-level git command has depth value "d0". + A child of it has depth value "d1". A second level child + has depth value "d2" and so on. + +`<thread-name>`:: + is a unique name for the thread. The primary thread + is called "main". Other thread names are of the form "th%d:%s" + and include a unique number and the name of the thread-proc. + +`<event-name>`:: + is the event name. + +`<repo-id>`:: + when present, is a number indicating the repository + in use. A `def_repo` event is emitted when a repository is + opened. This defines the repo-id and associated worktree. + Subsequent repo-specific events will reference this repo-id. ++ +Currently, this is always "r1" for the main repository. +This field is in anticipation of in-proc submodules in the future. + +`<t_abs>`:: + when present, is the absolute time in seconds since the + program started. + +`<t_rel>`:: + when present, is time in seconds relative to the start of + the current region. For a thread-exit event, it is the elapsed + time of the thread. + +`<category>`:: + is present on region and data events and is used to + indicate a broad category, such as "index" or "status". + +`<perf-event-message>`:: + is a free-form printf message intended for human consumption. + +------------ +15:33:33.532712 wt-status.c:2310 | d0 | main | region_enter | r1 | 0.126064 | | status | label:print +15:33:33.532712 wt-status.c:2331 | d0 | main | region_leave | r1 | 0.127568 | 0.001504 | status | label:print +------------ + +If `GIT_TRACE2_PERF_BRIEF` or `trace2.perfBrief` is true, the `time`, `file`, +and `line` fields are omitted. + +------------ +d0 | main | region_leave | r1 | 0.011717 | 0.009122 | index | label:preload +------------ + +The PERF target is intended for interactive performance analysis +during development and is quite noisy. + +=== EVENT Format + +Each event is a JSON-object containing multiple key/value pairs +written as a single line and followed by a LF. + +------------ +'{' <key> ':' <value> [',' <key> ':' <value>]* '}' LF +------------ + +Some key/value pairs are common to all events and some are +event-specific. + +==== Common Key/Value Pairs + +The following key/value pairs are common to all events: + +------------ +{ + "event":"version", + "sid":"20190408T191827.272759Z-H9b68c35f-P00003510", + "thread":"main", + "time":"2019-04-08T19:18:27.282761Z", + "file":"common-main.c", + "line":42, + ... +} +------------ + +`"event":<event>`:: + is the event name. + +`"sid":<sid>`:: + is the session-id. This is a unique string to identify the + process instance to allow all events emitted by a process to + be identified. A session-id is used instead of a PID because + PIDs are recycled by the OS. For child git processes, the + session-id is prepended with the session-id of the parent git + process to allow parent-child relationships to be identified + during post-processing. + +`"thread":<thread>`:: + is the thread name. + +`"time":<time>`:: + is the UTC time of the event. + +`"file":<filename>`:: + is source file generating the event. + +`"line":<line-number>`:: + is the integer source line number generating the event. + +`"repo":<repo-id>`:: + when present, is the integer repo-id as described previously. + +If `GIT_TRACE2_EVENT_BRIEF` or `trace2.eventBrief` is true, the `file` +and `line` fields are omitted from all events and the `time` field is +only present on the "start" and "atexit" events. + +==== Event-Specific Key/Value Pairs + +`"version"`:: + This event gives the version of the executable and the EVENT format. ++ +------------ +{ + "event":"version", + ... + "evt":"1", # EVENT format version + "exe":"2.20.1.155.g426c96fcdb" # git version +} +------------ + +`"start"`:: + This event contains the complete argv received by main(). ++ +------------ +{ + "event":"start", + ... + "t_abs":0.001227, # elapsed time in seconds + "argv":["git","version"] +} +------------ + +`"exit"`:: + This event is emitted when git calls `exit()`. ++ +------------ +{ + "event":"exit", + ... + "t_abs":0.001227, # elapsed time in seconds + "code":0 # exit code +} +------------ + +`"atexit"`:: + This event is emitted by the Trace2 `atexit` routine during + final shutdown. It should be the last event emitted by the + process. ++ +(The elapsed time reported here is greater than the time reported in +the "exit" event because it runs after all other atexit tasks have +completed.) ++ +------------ +{ + "event":"atexit", + ... + "t_abs":0.001227, # elapsed time in seconds + "code":0 # exit code +} +------------ + +`"signal"`:: + This event is emitted when the program is terminated by a user + signal. Depending on the platform, the signal event may + prevent the "atexit" event from being generated. ++ +------------ +{ + "event":"signal", + ... + "t_abs":0.001227, # elapsed time in seconds + "signo":13 # SIGTERM, SIGINT, etc. +} +------------ + +`"error"`:: + This event is emitted when one of the `error()`, `die()`, + or `usage()` functions are called. ++ +------------ +{ + "event":"error", + ... + "msg":"invalid option: --cahced", # formatted error message + "fmt":"invalid option: %s" # error format string +} +------------ ++ +The error event may be emitted more than once. The format string +allows post-processors to group errors by type without worrying +about specific error arguments. + +`"cmd_path"`:: + This event contains the discovered full path of the git + executable (on platforms that are configured to resolve it). ++ +------------ +{ + "event":"cmd_path", + ... + "path":"C:/work/gfw/git.exe" +} +------------ + +`"cmd_name"`:: + This event contains the command name for this git process + and the hierarchy of commands from parent git processes. ++ +------------ +{ + "event":"cmd_name", + ... + "name":"pack-objects", + "hierarchy":"push/pack-objects" +} +------------ ++ +Normally, the "name" field contains the canonical name of the +command. When a canonical name is not available, one of +these special values are used: ++ +------------ +"_query_" # "git --html-path" +"_run_dashed_" # when "git foo" tries to run "git-foo" +"_run_shell_alias_" # alias expansion to a shell command +"_run_git_alias_" # alias expansion to a git command +"_usage_" # usage error +------------ + +`"cmd_mode"`:: + This event, when present, describes the command variant This + event may be emitted more than once. ++ +------------ +{ + "event":"cmd_mode", + ... + "name":"branch" +} +------------ ++ +The "name" field is an arbitrary string to describe the command mode. +For example, checkout can checkout a branch or an individual file. +And these variations typically have different performance +characteristics that are not comparable. + +`"alias"`:: + This event is present when an alias is expanded. ++ +------------ +{ + "event":"alias", + ... + "alias":"l", # registered alias + "argv":["log","--graph"] # alias expansion +} +------------ + +`"child_start"`:: + This event describes a child process that is about to be + spawned. ++ +------------ +{ + "event":"child_start", + ... + "child_id":2, + "child_class":"?", + "use_shell":false, + "argv":["git","rev-list","--objects","--stdin","--not","--all","--quiet"] + + "hook_name":"<hook_name>" # present when child_class is "hook" + "cd":"<path>" # present when cd is required +} +------------ ++ +The "child_id" field can be used to match this child_start with the +corresponding child_exit event. ++ +The "child_class" field is a rough classification, such as "editor", +"pager", "transport/*", and "hook". Unclassified children are classified +with "?". + +`"child_exit"`:: + This event is generated after the current process has returned + from the waitpid() and collected the exit information from the + child. ++ +------------ +{ + "event":"child_exit", + ... + "child_id":2, + "pid":14708, # child PID + "code":0, # child exit-code + "t_rel":0.110605 # observed run-time of child process +} +------------ ++ +Note that the session-id of the child process is not available to +the current/spawning process, so the child's PID is reported here as +a hint for post-processing. (But it is only a hint because the child +proces may be a shell script which doesn't have a session-id.) ++ +Note that the `t_rel` field contains the observed run time in seconds +for the child process (starting before the fork/exec/spawn and +stopping after the waitpid() and includes OS process creation overhead). +So this time will be slightly larger than the atexit time reported by +the child process itself. + +`"exec"`:: + This event is generated before git attempts to `exec()` + another command rather than starting a child process. ++ +------------ +{ + "event":"exec", + ... + "exec_id":0, + "exe":"git", + "argv":["foo", "bar"] +} +------------ ++ +The "exec_id" field is a command-unique id and is only useful if the +`exec()` fails and a corresponding exec_result event is generated. + +`"exec_result"`:: + This event is generated if the `exec()` fails and control + returns to the current git command. ++ +------------ +{ + "event":"exec_result", + ... + "exec_id":0, + "code":1 # error code (errno) from exec() +} +------------ + +`"thread_start"`:: + This event is generated when a thread is started. It is + generated from *within* the new thread's thread-proc (for TLS + reasons). ++ +------------ +{ + "event":"thread_start", + ... + "thread":"th02:preload_thread" # thread name +} +------------ + +`"thread_exit"`:: + This event is generated when a thread exits. It is generated + from *within* the thread's thread-proc (for TLS reasons). ++ +------------ +{ + "event":"thread_exit", + ... + "thread":"th02:preload_thread", # thread name + "t_rel":0.007328 # thread elapsed time +} +------------ + +`"def_param"`:: + This event is generated to log a global parameter. ++ +------------ +{ + "event":"def_param", + ... + "param":"core.abbrev", + "value":"7" +} +------------ + +`"def_repo"`:: + This event defines a repo-id and associates it with the root + of the worktree. ++ +------------ +{ + "event":"def_repo", + ... + "repo":1, + "worktree":"/Users/jeffhost/work/gfw" +} +------------ ++ +As stated earlier, the repo-id is currently always 1, so there will +only be one def_repo event. Later, if in-proc submodules are +supported, a def_repo event should be emitted for each submodule +visited. + +`"region_enter"`:: + This event is generated when entering a region. ++ +------------ +{ + "event":"region_enter", + ... + "repo":1, # optional + "nesting":1, # current region stack depth + "category":"index", # optional + "label":"do_read_index", # optional + "msg":".git/index" # optional +} +------------ ++ +The `category` field may be used in a future enhancement to +do category-based filtering. ++ +`GIT_TRACE2_EVENT_NESTING` or `trace2.eventNesting` can be used to +filter deeply nested regions and data events. It defaults to "2". + +`"region_leave"`:: + This event is generated when leaving a region. ++ +------------ +{ + "event":"region_leave", + ... + "repo":1, # optional + "t_rel":0.002876, # time spent in region in seconds + "nesting":1, # region stack depth + "category":"index", # optional + "label":"do_read_index", # optional + "msg":".git/index" # optional +} +------------ + +`"data"`:: + This event is generated to log a thread- and region-local + key/value pair. ++ +------------ +{ + "event":"data", + ... + "repo":1, # optional + "t_abs":0.024107, # absolute elapsed time + "t_rel":0.001031, # elapsed time in region/thread + "nesting":2, # region stack depth + "category":"index", + "key":"read/cache_nr", + "value":"3552" +} +------------ ++ +The "value" field may be an integer or a string. + +`"data-json"`:: + This event is generated to log a pre-formatted JSON string + containing structured data. ++ +------------ +{ + "event":"data_json", + ... + "repo":1, # optional + "t_abs":0.015905, + "t_rel":0.015905, + "nesting":1, + "category":"process", + "key":"windows/ancestry", + "value":["bash.exe","bash.exe"] +} +------------ + +== Example Trace2 API Usage + +Here is a hypothetical usage of the Trace2 API showing the intended +usage (without worrying about the actual Git details). + +Initialization:: + + Initialization happens in `main()`. Behind the scenes, an + `atexit` and `signal` handler are registered. ++ +---------------- +int main(int argc, const char **argv) +{ + int exit_code; + + trace2_initialize(); + trace2_cmd_start(argv); + + exit_code = cmd_main(argc, argv); + + trace2_cmd_exit(exit_code); + + return exit_code; +} +---------------- + +Command Details:: + + After the basics are established, additional command + information can be sent to Trace2 as it is discovered. ++ +---------------- +int cmd_checkout(int argc, const char **argv) +{ + trace2_cmd_name("checkout"); + trace2_cmd_mode("branch"); + trace2_def_repo(the_repository); + + // emit "def_param" messages for "interesting" config settings. + trace2_cmd_list_config(); + + if (do_something()) + trace2_cmd_error("Path '%s': cannot do something", path); + + return 0; +} +---------------- + +Child Processes:: + + Wrap code spawning child processes. ++ +---------------- +void run_child(...) +{ + int child_exit_code; + struct child_process cmd = CHILD_PROCESS_INIT; + ... + cmd.trace2_child_class = "editor"; + + trace2_child_start(&cmd); + child_exit_code = spawn_child_and_wait_for_it(); + trace2_child_exit(&cmd, child_exit_code); +} +---------------- ++ +For example, the following fetch command spawned ssh, index-pack, +rev-list, and gc. This example also shows that fetch took +5.199 seconds and of that 4.932 was in ssh. ++ +---------------- +$ export GIT_TRACE2_BRIEF=1 +$ export GIT_TRACE2=~/log.normal +$ git fetch origin +... +---------------- ++ +---------------- +$ cat ~/log.normal +version 2.20.1.vfs.1.1.47.g534dbe1ad1 +start git fetch origin +worktree /Users/jeffhost/work/gfw +cmd_name fetch (fetch) +child_start[0] ssh git@github.com ... +child_start[1] git index-pack ... +... (Trace2 events from child processes omitted) +child_exit[1] pid:14707 code:0 elapsed:0.076353 +child_exit[0] pid:14706 code:0 elapsed:4.931869 +child_start[2] git rev-list ... +... (Trace2 events from child process omitted) +child_exit[2] pid:14708 code:0 elapsed:0.110605 +child_start[3] git gc --auto +... (Trace2 events from child process omitted) +child_exit[3] pid:14709 code:0 elapsed:0.006240 +exit elapsed:5.198503 code:0 +atexit elapsed:5.198541 code:0 +---------------- ++ +When a git process is a (direct or indirect) child of another +git process, it inherits Trace2 context information. This +allows the child to print the command hierarchy. This example +shows gc as child[3] of fetch. When the gc process reports +its name as "gc", it also reports the hierarchy as "fetch/gc". +(In this example, trace2 messages from the child process is +indented for clarity.) ++ +---------------- +$ export GIT_TRACE2_BRIEF=1 +$ export GIT_TRACE2=~/log.normal +$ git fetch origin +... +---------------- ++ +---------------- +$ cat ~/log.normal +version 2.20.1.160.g5676107ecd.dirty +start git fetch official +worktree /Users/jeffhost/work/gfw +cmd_name fetch (fetch) +... +child_start[3] git gc --auto + version 2.20.1.160.g5676107ecd.dirty + start /Users/jeffhost/work/gfw/git gc --auto + worktree /Users/jeffhost/work/gfw + cmd_name gc (fetch/gc) + exit elapsed:0.001959 code:0 + atexit elapsed:0.001997 code:0 +child_exit[3] pid:20303 code:0 elapsed:0.007564 +exit elapsed:3.868938 code:0 +atexit elapsed:3.868970 code:0 +---------------- + +Regions:: + + Regions can be use to time an interesting section of code. ++ +---------------- +void wt_status_collect(struct wt_status *s) +{ + trace2_region_enter("status", "worktrees", s->repo); + wt_status_collect_changes_worktree(s); + trace2_region_leave("status", "worktrees", s->repo); + + trace2_region_enter("status", "index", s->repo); + wt_status_collect_changes_index(s); + trace2_region_leave("status", "index", s->repo); + + trace2_region_enter("status", "untracked", s->repo); + wt_status_collect_untracked(s); + trace2_region_leave("status", "untracked", s->repo); +} + +void wt_status_print(struct wt_status *s) +{ + trace2_region_enter("status", "print", s->repo); + switch (s->status_format) { + ... + } + trace2_region_leave("status", "print", s->repo); +} +---------------- ++ +In this example, scanning for untracked files ran from +0.012568 to ++0.027149 (since the process started) and took 0.014581 seconds. ++ +---------------- +$ export GIT_TRACE2_PERF_BRIEF=1 +$ export GIT_TRACE2_PERF=~/log.perf +$ git status +... + +$ cat ~/log.perf +d0 | main | version | | | | | 2.20.1.160.g5676107ecd.dirty +d0 | main | start | | 0.001173 | | | git status +d0 | main | def_repo | r1 | | | | worktree:/Users/jeffhost/work/gfw +d0 | main | cmd_name | | | | | status (status) +... +d0 | main | region_enter | r1 | 0.010988 | | status | label:worktrees +d0 | main | region_leave | r1 | 0.011236 | 0.000248 | status | label:worktrees +d0 | main | region_enter | r1 | 0.011260 | | status | label:index +d0 | main | region_leave | r1 | 0.012542 | 0.001282 | status | label:index +d0 | main | region_enter | r1 | 0.012568 | | status | label:untracked +d0 | main | region_leave | r1 | 0.027149 | 0.014581 | status | label:untracked +d0 | main | region_enter | r1 | 0.027411 | | status | label:print +d0 | main | region_leave | r1 | 0.028741 | 0.001330 | status | label:print +d0 | main | exit | | 0.028778 | | | code:0 +d0 | main | atexit | | 0.028809 | | | code:0 +---------------- ++ +Regions may be nested. This causes messages to be indented in the +PERF target, for example. +Elapsed times are relative to the start of the correpsonding nesting +level as expected. For example, if we add region message to: ++ +---------------- +static enum path_treatment read_directory_recursive(struct dir_struct *dir, + struct index_state *istate, const char *base, int baselen, + struct untracked_cache_dir *untracked, int check_only, + int stop_at_first_file, const struct pathspec *pathspec) +{ + enum path_treatment state, subdir_state, dir_state = path_none; + + trace2_region_enter_printf("dir", "read_recursive", NULL, "%.*s", baselen, base); + ... + trace2_region_leave_printf("dir", "read_recursive", NULL, "%.*s", baselen, base); + return dir_state; +} +---------------- ++ +We can further investigate the time spent scanning for untracked files. ++ +---------------- +$ export GIT_TRACE2_PERF_BRIEF=1 +$ export GIT_TRACE2_PERF=~/log.perf +$ git status +... +$ cat ~/log.perf +d0 | main | version | | | | | 2.20.1.162.gb4ccea44db.dirty +d0 | main | start | | 0.001173 | | | git status +d0 | main | def_repo | r1 | | | | worktree:/Users/jeffhost/work/gfw +d0 | main | cmd_name | | | | | status (status) +... +d0 | main | region_enter | r1 | 0.015047 | | status | label:untracked +d0 | main | region_enter | | 0.015132 | | dir | ..label:read_recursive +d0 | main | region_enter | | 0.016341 | | dir | ....label:read_recursive vcs-svn/ +d0 | main | region_leave | | 0.016422 | 0.000081 | dir | ....label:read_recursive vcs-svn/ +d0 | main | region_enter | | 0.016446 | | dir | ....label:read_recursive xdiff/ +d0 | main | region_leave | | 0.016522 | 0.000076 | dir | ....label:read_recursive xdiff/ +d0 | main | region_enter | | 0.016612 | | dir | ....label:read_recursive git-gui/ +d0 | main | region_enter | | 0.016698 | | dir | ......label:read_recursive git-gui/po/ +d0 | main | region_enter | | 0.016810 | | dir | ........label:read_recursive git-gui/po/glossary/ +d0 | main | region_leave | | 0.016863 | 0.000053 | dir | ........label:read_recursive git-gui/po/glossary/ +... +d0 | main | region_enter | | 0.031876 | | dir | ....label:read_recursive builtin/ +d0 | main | region_leave | | 0.032270 | 0.000394 | dir | ....label:read_recursive builtin/ +d0 | main | region_leave | | 0.032414 | 0.017282 | dir | ..label:read_recursive +d0 | main | region_leave | r1 | 0.032454 | 0.017407 | status | label:untracked +... +d0 | main | exit | | 0.034279 | | | code:0 +d0 | main | atexit | | 0.034322 | | | code:0 +---------------- ++ +Trace2 regions are similar to the existing trace_performance_enter() +and trace_performance_leave() routines, but are thread safe and +maintain per-thread stacks of timers. + +Data Messages:: + + Data messages added to a region. ++ +---------------- +int read_index_from(struct index_state *istate, const char *path, + const char *gitdir) +{ + trace2_region_enter_printf("index", "do_read_index", the_repository, "%s", path); + + ... + + trace2_data_intmax("index", the_repository, "read/version", istate->version); + trace2_data_intmax("index", the_repository, "read/cache_nr", istate->cache_nr); + + trace2_region_leave_printf("index", "do_read_index", the_repository, "%s", path); +} +---------------- ++ +This example shows that the index contained 3552 entries. ++ +---------------- +$ export GIT_TRACE2_PERF_BRIEF=1 +$ export GIT_TRACE2_PERF=~/log.perf +$ git status +... +$ cat ~/log.perf +d0 | main | version | | | | | 2.20.1.156.gf9916ae094.dirty +d0 | main | start | | 0.001173 | | | git status +d0 | main | def_repo | r1 | | | | worktree:/Users/jeffhost/work/gfw +d0 | main | cmd_name | | | | | status (status) +d0 | main | region_enter | r1 | 0.001791 | | index | label:do_read_index .git/index +d0 | main | data | r1 | 0.002494 | 0.000703 | index | ..read/version:2 +d0 | main | data | r1 | 0.002520 | 0.000729 | index | ..read/cache_nr:3552 +d0 | main | region_leave | r1 | 0.002539 | 0.000748 | index | label:do_read_index .git/index +... +---------------- + +Thread Events:: + + Thread messages added to a thread-proc. ++ +For example, the multithreaded preload-index code can be +instrumented with a region around the thread pool and then +per-thread start and exit events within the threadproc. ++ +---------------- +static void *preload_thread(void *_data) +{ + // start the per-thread clock and emit a message. + trace2_thread_start("preload_thread"); + + // report which chunk of the array this thread was assigned. + trace2_data_intmax("index", the_repository, "offset", p->offset); + trace2_data_intmax("index", the_repository, "count", nr); + + do { + ... + } while (--nr > 0); + ... + + // report elapsed time taken by this thread. + trace2_thread_exit(); + return NULL; +} + +void preload_index(struct index_state *index, + const struct pathspec *pathspec, + unsigned int refresh_flags) +{ + trace2_region_enter("index", "preload", the_repository); + + for (i = 0; i < threads; i++) { + ... /* create thread */ + } + + for (i = 0; i < threads; i++) { + ... /* join thread */ + } + + trace2_region_leave("index", "preload", the_repository); +} +---------------- ++ +In this example preload_index() was executed by the `main` thread +and started the `preload` region. Seven threads, named +`th01:preload_thread` through `th07:preload_thread`, were started. +Events from each thread are atomically appended to the shared target +stream as they occur so they may appear in random order with respect +other threads. Finally, the main thread waits for the threads to +finish and leaves the region. ++ +Data events are tagged with the active thread name. They are used +to report the per-thread parameters. ++ +---------------- +$ export GIT_TRACE2_PERF_BRIEF=1 +$ export GIT_TRACE2_PERF=~/log.perf +$ git status +... +$ cat ~/log.perf +... +d0 | main | region_enter | r1 | 0.002595 | | index | label:preload +d0 | th01:preload_thread | thread_start | | 0.002699 | | | +d0 | th02:preload_thread | thread_start | | 0.002721 | | | +d0 | th01:preload_thread | data | r1 | 0.002736 | 0.000037 | index | offset:0 +d0 | th02:preload_thread | data | r1 | 0.002751 | 0.000030 | index | offset:2032 +d0 | th03:preload_thread | thread_start | | 0.002711 | | | +d0 | th06:preload_thread | thread_start | | 0.002739 | | | +d0 | th01:preload_thread | data | r1 | 0.002766 | 0.000067 | index | count:508 +d0 | th06:preload_thread | data | r1 | 0.002856 | 0.000117 | index | offset:2540 +d0 | th03:preload_thread | data | r1 | 0.002824 | 0.000113 | index | offset:1016 +d0 | th04:preload_thread | thread_start | | 0.002710 | | | +d0 | th02:preload_thread | data | r1 | 0.002779 | 0.000058 | index | count:508 +d0 | th06:preload_thread | data | r1 | 0.002966 | 0.000227 | index | count:508 +d0 | th07:preload_thread | thread_start | | 0.002741 | | | +d0 | th07:preload_thread | data | r1 | 0.003017 | 0.000276 | index | offset:3048 +d0 | th05:preload_thread | thread_start | | 0.002712 | | | +d0 | th05:preload_thread | data | r1 | 0.003067 | 0.000355 | index | offset:1524 +d0 | th05:preload_thread | data | r1 | 0.003090 | 0.000378 | index | count:508 +d0 | th07:preload_thread | data | r1 | 0.003037 | 0.000296 | index | count:504 +d0 | th03:preload_thread | data | r1 | 0.002971 | 0.000260 | index | count:508 +d0 | th04:preload_thread | data | r1 | 0.002983 | 0.000273 | index | offset:508 +d0 | th04:preload_thread | data | r1 | 0.007311 | 0.004601 | index | count:508 +d0 | th05:preload_thread | thread_exit | | 0.008781 | 0.006069 | | +d0 | th01:preload_thread | thread_exit | | 0.009561 | 0.006862 | | +d0 | th03:preload_thread | thread_exit | | 0.009742 | 0.007031 | | +d0 | th06:preload_thread | thread_exit | | 0.009820 | 0.007081 | | +d0 | th02:preload_thread | thread_exit | | 0.010274 | 0.007553 | | +d0 | th07:preload_thread | thread_exit | | 0.010477 | 0.007736 | | +d0 | th04:preload_thread | thread_exit | | 0.011657 | 0.008947 | | +d0 | main | region_leave | r1 | 0.011717 | 0.009122 | index | label:preload +... +d0 | main | exit | | 0.029996 | | | code:0 +d0 | main | atexit | | 0.030027 | | | code:0 +---------------- ++ +In this example, the preload region took 0.009122 seconds. The 7 threads +took between 0.006069 and 0.008947 seconds to work on their portion of +the index. Thread "th01" worked on 508 items at offset 0. Thread "th02" +worked on 508 items at offset 2032. Thread "th04" worked on 508 itemts +at offset 508. ++ +This example also shows that thread names are assigned in a racy manner +as each thread starts and allocates TLS storage. + +== Future Work + +=== Relationship to the Existing Trace Api (api-trace.txt) + +There are a few issues to resolve before we can completely +switch to Trace2. + +* Updating existing tests that assume GIT_TRACE format messages. + +* How to best handle custom GIT_TRACE_<key> messages? + +** The GIT_TRACE_<key> mechanism allows each <key> to write to a +different file (in addition to just stderr). + +** Do we want to maintain that ability or simply write to the existing +Trace2 targets (and convert <key> to a "category"). diff --git a/Documentation/technical/api-tree-walking.txt b/Documentation/technical/api-tree-walking.txt new file mode 100644 index 000000000000..bde18622a874 --- /dev/null +++ b/Documentation/technical/api-tree-walking.txt @@ -0,0 +1,147 @@ +tree walking API +================ + +The tree walking API is used to traverse and inspect trees. + +Data Structures +--------------- + +`struct name_entry`:: + + An entry in a tree. Each entry has a sha1 identifier, pathname, and + mode. + +`struct tree_desc`:: + + A semi-opaque data structure used to maintain the current state of the + walk. ++ +* `buffer` is a pointer into the memory representation of the tree. It always +points at the current entry being visited. + +* `size` counts the number of bytes left in the `buffer`. + +* `entry` points to the current entry being visited. + +`struct traverse_info`:: + + A structure used to maintain the state of a traversal. ++ +* `prev` points to the traverse_info which was used to descend into the +current tree. If this is the top-level tree `prev` will point to +a dummy traverse_info. + +* `name` is the entry for the current tree (if the tree is a subtree). + +* `pathlen` is the length of the full path for the current tree. + +* `conflicts` can be used by callbacks to maintain directory-file conflicts. + +* `fn` is a callback called for each entry in the tree. See Traversing for more +information. + +* `data` can be anything the `fn` callback would want to use. + +* `show_all_errors` tells whether to stop at the first error or not. + +Initializing +------------ + +`init_tree_desc`:: + + Initialize a `tree_desc` and decode its first entry. The buffer and + size parameters are assumed to be the same as the buffer and size + members of `struct tree`. + +`fill_tree_descriptor`:: + + Initialize a `tree_desc` and decode its first entry given the + object ID of a tree. Returns the `buffer` member if the latter + is a valid tree identifier and NULL otherwise. + +`setup_traverse_info`:: + + Initialize a `traverse_info` given the pathname of the tree to start + traversing from. The `base` argument is assumed to be the `path` + member of the `name_entry` being recursed into unless the tree is a + top-level tree in which case the empty string ("") is used. + +Walking +------- + +`tree_entry`:: + + Visit the next entry in a tree. Returns 1 when there are more entries + left to visit and 0 when all entries have been visited. This is + commonly used in the test of a while loop. + +`tree_entry_len`:: + + Calculate the length of a tree entry's pathname. This utilizes the + memory structure of a tree entry to avoid the overhead of using a + generic strlen(). + +`update_tree_entry`:: + + Walk to the next entry in a tree. This is commonly used in conjunction + with `tree_entry_extract` to inspect the current entry. + +`tree_entry_extract`:: + + Decode the entry currently being visited (the one pointed to by + `tree_desc's` `entry` member) and return the sha1 of the entry. The + `pathp` and `modep` arguments are set to the entry's pathname and mode + respectively. + +`get_tree_entry`:: + + Find an entry in a tree given a pathname and the sha1 of a tree to + search. Returns 0 if the entry is found and -1 otherwise. The third + and fourth parameters are set to the entry's sha1 and mode + respectively. + +Traversing +---------- + +`traverse_trees`:: + + Traverse `n` number of trees in parallel. The `fn` callback member of + `traverse_info` is called once for each tree entry. + +`traverse_callback_t`:: + The arguments passed to the traverse callback are as follows: ++ +* `n` counts the number of trees being traversed. + +* `mask` has its nth bit set if something exists in the nth entry. + +* `dirmask` has its nth bit set if the nth tree's entry is a directory. + +* `entry` is an array of size `n` where the nth entry is from the nth tree. + +* `info` maintains the state of the traversal. + ++ +Returning a negative value will terminate the traversal. Otherwise the +return value is treated as an update mask. If the nth bit is set the nth tree +will be updated and if the bit is not set the nth tree entry will be the +same in the next callback invocation. + +`make_traverse_path`:: + + Generate the full pathname of a tree entry based from the root of the + traversal. For example, if the traversal has recursed into another + tree named "bar" the pathname of an entry "baz" in the "bar" + tree would be "bar/baz". + +`traverse_path_len`:: + + Calculate the length of a pathname returned by `make_traverse_path`. + This utilizes the memory structure of a tree entry to avoid the + overhead of using a generic strlen(). + +Authors +------- + +Written by Junio C Hamano <gitster@pobox.com> and Linus Torvalds +<torvalds@linux-foundation.org> diff --git a/Documentation/technical/api-xdiff-interface.txt b/Documentation/technical/api-xdiff-interface.txt new file mode 100644 index 000000000000..6296ecad1d65 --- /dev/null +++ b/Documentation/technical/api-xdiff-interface.txt @@ -0,0 +1,7 @@ +xdiff interface API +=================== + +Talk about our calling convention to xdiff library, including +xdiff_emit_consume_fn. + +(Dscho, JC) diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt new file mode 100644 index 000000000000..f8c18a0f7aec --- /dev/null +++ b/Documentation/technical/bitmap-format.txt @@ -0,0 +1,164 @@ +GIT bitmap v1 format +==================== + + - A header appears at the beginning: + + 4-byte signature: {'B', 'I', 'T', 'M'} + + 2-byte version number (network byte order) + The current implementation only supports version 1 + of the bitmap index (the same one as JGit). + + 2-byte flags (network byte order) + + The following flags are supported: + + - BITMAP_OPT_FULL_DAG (0x1) REQUIRED + This flag must always be present. It implies that the bitmap + index has been generated for a packfile with full closure + (i.e. where every single object in the packfile can find + its parent links inside the same packfile). This is a + requirement for the bitmap index format, also present in JGit, + that greatly reduces the complexity of the implementation. + + - BITMAP_OPT_HASH_CACHE (0x4) + If present, the end of the bitmap file contains + `N` 32-bit name-hash values, one per object in the + pack. The format and meaning of the name-hash is + described below. + + 4-byte entry count (network byte order) + + The total count of entries (bitmapped commits) in this bitmap index. + + 20-byte checksum + + The SHA1 checksum of the pack this bitmap index belongs to. + + - 4 EWAH bitmaps that act as type indexes + + Type indexes are serialized after the hash cache in the shape + of four EWAH bitmaps stored consecutively (see Appendix A for + the serialization format of an EWAH bitmap). + + There is a bitmap for each Git object type, stored in the following + order: + + - Commits + - Trees + - Blobs + - Tags + + In each bitmap, the `n`th bit is set to true if the `n`th object + in the packfile is of that type. + + The obvious consequence is that the OR of all 4 bitmaps will result + in a full set (all bits set), and the AND of all 4 bitmaps will + result in an empty bitmap (no bits set). + + - N entries with compressed bitmaps, one for each indexed commit + + Where `N` is the total amount of entries in this bitmap index. + Each entry contains the following: + + - 4-byte object position (network byte order) + The position **in the index for the packfile** where the + bitmap for this commit is found. + + - 1-byte XOR-offset + The xor offset used to compress this bitmap. For an entry + in position `x`, a XOR offset of `y` means that the actual + bitmap representing this commit is composed by XORing the + bitmap for this entry with the bitmap in entry `x-y` (i.e. + the bitmap `y` entries before this one). + + Note that this compression can be recursive. In order to + XOR this entry with a previous one, the previous entry needs + to be decompressed first, and so on. + + The hard-limit for this offset is 160 (an entry can only be + xor'ed against one of the 160 entries preceding it). This + number is always positive, and hence entries are always xor'ed + with **previous** bitmaps, not bitmaps that will come afterwards + in the index. + + - 1-byte flags for this bitmap + At the moment the only available flag is `0x1`, which hints + that this bitmap can be re-used when rebuilding bitmap indexes + for the repository. + + - The compressed bitmap itself, see Appendix A. + +== Appendix A: Serialization format for an EWAH bitmap + +Ewah bitmaps are serialized in the same protocol as the JAVAEWAH +library, making them backwards compatible with the JGit +implementation: + + - 4-byte number of bits of the resulting UNCOMPRESSED bitmap + + - 4-byte number of words of the COMPRESSED bitmap, when stored + + - N x 8-byte words, as specified by the previous field + + This is the actual content of the compressed bitmap. + + - 4-byte position of the current RLW for the compressed + bitmap + +All words are stored in network byte order for their corresponding +sizes. + +The compressed bitmap is stored in a form of run-length encoding, as +follows. It consists of a concatenation of an arbitrary number of +chunks. Each chunk consists of one or more 64-bit words + + H L_1 L_2 L_3 .... L_M + +H is called RLW (run length word). It consists of (from lower to higher +order bits): + + - 1 bit: the repeated bit B + + - 32 bits: repetition count K (unsigned) + + - 31 bits: literal word count M (unsigned) + +The bitstream represented by the above chunk is then: + + - K repetitions of B + + - The bits stored in `L_1` through `L_M`. Within a word, bits at + lower order come earlier in the stream than those at higher + order. + +The next word after `L_M` (if any) must again be a RLW, for the next +chunk. For efficient appending to the bitstream, the EWAH stores a +pointer to the last RLW in the stream. + + +== Appendix B: Optional Bitmap Sections + +These sections may or may not be present in the `.bitmap` file; their +presence is indicated by the header flags section described above. + +Name-hash cache +--------------- + +If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains +a cache of 32-bit values, one per object in the pack. The value at +position `i` is the hash of the pathname at which the `i`th object +(counting in index order) in the pack can be found. This can be fed +into the delta heuristics to compare objects with similar pathnames. + +The hash algorithm used is: + + hash = 0; + while ((c = *name++)) + if (!isspace(c)) + hash = (hash >> 2) + (c << 24); + +Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. +If implementations want to choose a different hashing scheme, they are +free to do so, but MUST allocate a new header flag (because comparing +hashes made under two different schemes would be pointless). diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 000000000000..a4f17441aed3 --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,104 @@ +Git commit graph format +======================= + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as unsigned 32-bit integers +corresponding to the array position within the list of commit OIDs. Due +to some special constants we use to track parents, we can store at most +(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte number (B) of base commit-graphs + We infer the length (H*B) of the Base Graphs chunk + from this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + ID appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0x7000000 if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Extra Edge List chunk. + * The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of commit + positions for the parents until reaching a value with the most-significant + bit on. The other bits correspond to the position of the last parent. + + Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional] + This list of H-byte hashes describe a set of B commit-graph files that + form a commit-graph chain. The graph position for the ith commit in this + file's OID Lookup chunk is equal to i plus the number of commits in all + base graphs. If B is non-zero, this chunk must exist. + +TRAILER: + + H-byte HASH-checksum of all of the above. diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 000000000000..729fbcb32f87 --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,350 @@ +Git Commit Graph Design Notes +============================= + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit-graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit-graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + + If A and B are commits with generation numbers N and M, respectively, + and N <= M, then A cannot reach B. That is, we know without searching + that B is not an ancestor of A because it is further from a root commit + than A. + + Conversely, when checking if A is an ancestor of B, then we only need + to walk commits until all commits on the walk boundary have generation + number at most N. If we walk commits using a priority queue seeded by + generation numbers, then we always expand the boundary commit with highest + generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + + If A and B are commits with commit time X and Y, respectively, and + X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + +Design Details +-------------- + +- The commit-graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +- Commit grafts and replace objects can change the shape of the commit + history. The latter can also be enabled/disabled on the fly using + `--no-replace-objects`. This leads to difficultly storing both possible + interpretations of a commit id, especially when computing generation + numbers. The commit-graph will not be read or written when + replace-objects or grafts are present. + +- Shallow clones create grafts of commits by dropping their parents. This + leads the commit-graph to think those commits have generation number 1. + If and when those commits are made unshallow, those generation numbers + become invalid. Since shallow clones are intended to restrict the commit + history to a very small set of commits, the commit-graph feature is less + helpful for these clones, anyway. The commit-graph will not be read or + written when shallow commits are present. + +Commit Graphs Chains +-------------------- + +Typically, repos grow with near-constant velocity (commits per day). Over time, +the number of commits added by a fetch operation is much smaller than the +number of commits in the full history. By creating a "chain" of commit-graphs, +we enable fast writes of new commit data without rewriting the entire commit +history -- at least, most of the time. + +## File Layout + +A commit-graph chain uses multiple files, and we use a fixed naming convention +to organize these files. Each commit-graph file has a name +`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex- +valued hash stored in the footer of that file (which is a hash of the file's +contents before that hash). For a chain of commit-graph files, a plain-text +file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the +hashes for the files in order from "lowest" to "highest". + +For example, if the `commit-graph-chain` file contains the lines + +``` + {hash0} + {hash1} + {hash2} +``` + +then the commit-graph chain looks like the following diagram: + + +-----------------------+ + | graph-{hash2}.graph | + +-----------------------+ + | + +-----------------------+ + | | + | graph-{hash1}.graph | + | | + +-----------------------+ + | + +-----------------------+ + | | + | | + | | + | graph-{hash0}.graph | + | | + | | + | | + +-----------------------+ + +Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of +commits in `graph-{hash1}.graph`, and X2 be the number of commits in +`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`, +then we interpret this as being the commit in position (X0 + X1 + i), and that +will be used as its "graph position". The commits in `graph-{hash2}.graph` use these +positions to refer to their parents, which may be in `graph-{hash1}.graph` or +`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking +its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + +X2). + +Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data +specifying the hashes of all files in the lower layers. In the above example, +`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains +`{hash0}` and `{hash1}`. + +## Merging commit-graph files + +If we only added a new commit-graph file on every write, we would run into a +linear search problem through many commit-graph files. Instead, we use a merge +strategy to decide when the stack should collapse some number of levels. + +The diagram below shows such a collapse. As a set of new commits are added, it +is determined by the merge strategy that the files should collapse to +`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and +the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` +file. + + +---------------------+ + | | + | (new commits) | + | | + +---------------------+ + | | + +-----------------------+ +---------------------+ + | graph-{hash2} |->| | + +-----------------------+ +---------------------+ + | | | + +-----------------------+ +---------------------+ + | | | | + | graph-{hash1} |->| | + | | | | + +-----------------------+ +---------------------+ + | tmp_graphXXX + +-----------------------+ + | | + | | + | | + | graph-{hash0} | + | | + | | + | | + +-----------------------+ + +During this process, the commits to write are combined, sorted and we write the +contents to a temporary file, all while holding a `commit-graph-chain.lock` +lock-file. When the file is flushed, we rename it to `graph-{hash3}` +according to the computed `{hash3}`. Finally, we write the new chain data to +`commit-graph-chain.lock`: + +``` + {hash3} + {hash0} +``` + +We then close the lock-file. + +## Merge Strategy + +When writing a set of commits that do not exist in the commit-graph stack of +height N, we default to creating a new file at level N + 1. We then decide to +merge with the Nth level if one of two conditions hold: + + 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in + level N is less than X times the number of commits in level N + 1. + + 2. `--max-commits=<C>` is specified with non-zero C and the number of commits + in level N + 1 is more than C commits. + +This decision cascades down the levels: when we merge a level we create a new +set of commits that then compares to the next level. + +The first condition bounds the number of levels to be logarithmic in the total +number of commits. The second condition bounds the total number of commits in +a `graph-{hashN}` file and not in the `commit-graph` file, preventing +significant performance issues when the stack merges and another process only +partially reads the previous stack. + +The merge strategy values (2 for the size multiple, 64,000 for the maximum +number of commits) could be extracted into config settings for full +flexibility. + +## Deleting graph-{hash} files + +After a new tip file is written, some `graph-{hash}` files may no longer +be part of a chain. It is important to remove these files from disk, eventually. +The main reason to delay removal is that another process could read the +`commit-graph-chain` file before it is rewritten, but then look for the +`graph-{hash}` files after they are deleted. + +To allow holding old split commit-graphs for a while after they are unreferenced, +we update the modified times of the files when they become unreferenced. Then, +we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}` +files whose modified times are older than a given expiry window. This window +defaults to zero, but can be changed using command-line arguments or a config +setting. + +## Chains across multiple object directories + +In a repo with alternates, we look for the `commit-graph-chain` file starting +in the local object directory and then in each alternate. The first file that +exists defines our chain. As we look for the `graph-{hash}` files for +each `{hash}` in the chain file, we follow the same pattern for the host +directories. + +This allows commit-graphs to be split across multiple forks in a fork network. +The typical case is a large "base" repo with many smaller forks. + +As the base repo advances, it will likely update and merge its commit-graph +chain more frequently than the forks. If a fork updates their commit-graph after +the base repo, then it should "reparent" the commit-graph chain onto the new +chain in the base repo. When reading each `graph-{hash}` file, we track +the object directory containing it. During a write of a new commit-graph file, +we check for any changes in the source object directory and read the +`commit-graph-chain` file for that source and create a new file based on those +files. During this "reparent" operation, we necessarily need to collapse all +levels in the fork, as all of the files are invalid against the new base file. + +It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph` +files in this scenario. It falls to the user to define the proper settings for +their custom environment: + + 1. When merging levels in the base repo, the unreferenced files may still be + referenced by chains from fork repos. + + 2. The expiry time should be set to a length of time such that every fork has + time to recompute their commit-graph chain to "reparent" onto the new base + file(s). + + 3. If the commit-graph chain is updated in the base, the fork will not have + access to the new chain until its chain is updated to reference those files. + (This may change in the future [5].) + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=8 + Chromium work item for: Serialized Commit Graph + +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ + An abandoned patch that introduced generation numbers. + +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ + Discussion about generation numbers on commits and how they interact + with fsck. + +[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ + More discussion about generation numbers and not storing them inside + commit objects. A valuable quote: + + "I think we should be moving more in the direction of keeping + repo-local caches for optimizations. Reachability bitmaps have been + a big performance win. I think we should be doing the same with our + properties of commits. Not just generation numbers, but making it + cheap to access the graph structure without zlib-inflating whole + commit objects (i.e., packv4 or something like the "metapacks" I + proposed a few years ago)." + +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u + A patch to remove the ahead-behind calculation from 'status'. + +[5] https://public-inbox.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/ + A discussion of a "two-dimensional graph position" that can allow reading + multiple commit-graph chains at the same time. diff --git a/Documentation/technical/directory-rename-detection.txt b/Documentation/technical/directory-rename-detection.txt new file mode 100644 index 000000000000..844629c8c441 --- /dev/null +++ b/Documentation/technical/directory-rename-detection.txt @@ -0,0 +1,115 @@ +Directory rename detection +========================== + +Rename detection logic in diffcore-rename that checks for renames of +individual files is aggregated and analyzed in merge-recursive for cases +where combinations of renames indicate that a full directory has been +renamed. + +Scope of abilities +------------------ + +It is perhaps easiest to start with an example: + + * When all of x/a, x/b and x/c have moved to z/a, z/b and z/c, it is + likely that x/d added in the meantime would also want to move to z/d by + taking the hint that the entire directory 'x' moved to 'z'. + +More interesting possibilities exist, though, such as: + + * one side of history renames x -> z, and the other renames some file to + x/e, causing the need for the merge to do a transitive rename. + + * one side of history renames x -> z, but also renames all files within x. + For example, x/a -> z/alpha, x/b -> z/bravo, etc. + + * both 'x' and 'y' being merged into a single directory 'z', with a + directory rename being detected for both x->z and y->z. + + * not all files in a directory being renamed to the same location; + i.e. perhaps most the files in 'x' are now found under 'z', but a few + are found under 'w'. + + * a directory being renamed, which also contained a subdirectory that was + renamed to some entirely different location. (And perhaps the inner + directory itself contained inner directories that were renamed to yet + other locations). + + * combinations of the above; see t/t6043-merge-rename-directories.sh for + various interesting cases. + +Limitations -- applicability of directory renames +------------------------------------------------- + +In order to prevent edge and corner cases resulting in either conflicts +that cannot be represented in the index or which might be too complex for +users to try to understand and resolve, a couple basic rules limit when +directory rename detection applies: + + 1) If a given directory still exists on both sides of a merge, we do + not consider it to have been renamed. + + 2) If a subset of to-be-renamed files have a file or directory in the + way (or would be in the way of each other), "turn off" the directory + rename for those specific sub-paths and report the conflict to the + user. + + 3) If the other side of history did a directory rename to a path that + your side of history renamed away, then ignore that particular + rename from the other side of history for any implicit directory + renames (but warn the user). + +Limitations -- detailed rules and testcases +------------------------------------------- + +t/t6043-merge-rename-directories.sh contains extensive tests and commentary +which generate and explore the rules listed above. It also lists a few +additional rules: + + a) If renames split a directory into two or more others, the directory + with the most renames, "wins". + + b) Avoid directory-rename-detection for a path, if that path is the + source of a rename on either side of a merge. + + c) Only apply implicit directory renames to directories if the other side + of history is the one doing the renaming. + +Limitations -- support in different commands +-------------------------------------------- + +Directory rename detection is supported by 'merge' and 'cherry-pick'. +Other git commands which users might be surprised to see limited or no +directory rename detection support in: + + * diff + + Folks have requested in the past that `git diff` detect directory + renames and somehow simplify its output. It is not clear whether this + would be desirable or how the output should be simplified, so this was + simply not implemented. Further, to implement this, directory rename + detection logic would need to move from merge-recursive to + diffcore-rename. + + * am + + git-am tries to avoid a full three way merge, instead calling + git-apply. That prevents us from detecting renames at all, which may + defeat the directory rename detection. There is a fallback, though; if + the initial git-apply fails and the user has specified the -3 option, + git-am will fall back to a three way merge. However, git-am lacks the + necessary information to do a "real" three way merge. Instead, it has + to use build_fake_ancestor() to get a merge base that is missing files + whose rename may have been important to detect for directory rename + detection to function. + + * rebase + + Since am-based rebases work by first generating a bunch of patches + (which no longer record what the original commits were and thus don't + have the necessary info from which we can find a real merge-base), and + then calling git-am, this implies that am-based rebases will not always + successfully detect directory renames either (see the 'am' section + above). merged-based rebases (rebase -m) and cherry-pick-based rebases + (rebase -i) are not affected by this shortcoming, and fully support + directory rename detection. diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt new file mode 100644 index 000000000000..2ae8fa470ada --- /dev/null +++ b/Documentation/technical/hash-function-transition.txt @@ -0,0 +1,827 @@ +Git hash function transition +============================ + +Objective +--------- +Migrate Git from SHA-1 to a stronger hash function. + +Background +---------- +At its core, the Git version control system is a content addressable +filesystem. It uses the SHA-1 hash function to name content. For +example, files, directories, and revisions are referred to by hash +values unlike in other traditional version control systems where files +or versions are referred to via sequential numbers. The use of a hash +function to address its content delivers a few advantages: + +* Integrity checking is easy. Bit flips, for example, are easily + detected, as the hash of corrupted content does not match its name. +* Lookup of objects is fast. + +Using a cryptographically secure hash function brings additional +advantages: + +* Object names can be signed and third parties can trust the hash to + address the signed object and all objects it references. +* Communication using Git protocol and out of band communication + methods have a short reliable string that can be used to reliably + address stored content. + +Over time some flaws in SHA-1 have been discovered by security +researchers. On 23 February 2017 the SHAttered attack +(https://shattered.io) demonstrated a practical SHA-1 hash collision. + +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default, which isn't vulnerable to the SHAttered +attack. + +Thus Git has in effect already migrated to a new hash that isn't SHA-1 +and doesn't share its vulnerabilities, its new hash function just +happens to produce exactly the same output for all known inputs, +except two PDFs published by the SHAttered researchers, and the new +implementation (written by those researchers) claims to detect future +cryptanalytic collision attacks. + +Regardless, it's considered prudent to move past any variant of SHA-1 +to a new hash. There's no guarantee that future attacks on SHA-1 won't +be published in the future, and those attacks may not have viable +mitigations. + +If SHA-1 and its variants were to be truly broken, Git's hash function +could not be considered cryptographically secure any more. This would +impact the communication of hash values because we could not trust +that a given hash value represented the known good version of content +that the speaker intended. + +SHA-1 still possesses the other properties such as fast object lookup +and safe error checking, but other hash functions are equally suitable +that are believed to be cryptographically secure. + +Goals +----- +1. The transition to SHA-256 can be done one local repository at a time. + a. Requiring no action by any other party. + b. A SHA-256 repository can communicate with SHA-1 Git servers + (push/fetch). + c. Users can use SHA-1 and SHA-256 identifiers for objects + interchangeably (see "Object names on the command line", below). + d. New signed objects make use of a stronger hash function than + SHA-1 for their security guarantees. +2. Allow a complete transition away from SHA-1. + a. Local metadata for SHA-1 compatibility can be removed from a + repository if compatibility with SHA-1 is no longer needed. +3. Maintainability throughout the process. + a. The object format is kept simple and consistent. + b. Creation of a generalized repository conversion tool. + +Non-Goals +--------- +1. Add SHA-256 support to Git protocol. This is valuable and the + logical next step but it is out of scope for this initial design. +2. Transparently improving the security of existing SHA-1 signed + objects. +3. Intermixing objects using multiple hash functions in a single + repository. +4. Taking the opportunity to fix other bugs in Git's formats and + protocols. +5. Shallow clones and fetches into a SHA-256 repository. (This will + change when we add SHA-256 support to Git protocol.) +6. Skip fetching some submodules of a project into a SHA-256 + repository. (This also depends on SHA-256 support in Git + protocol.) + +Overview +-------- +We introduce a new repository format extension. Repositories with this +extension enabled use SHA-256 instead of SHA-1 to name their objects. +This affects both object names and object content --- both the names +of objects and all references to other objects within an object are +switched to the new hash function. + +SHA-256 repositories cannot be read by older versions of Git. + +Alongside the packfile, a SHA-256 repository stores a bidirectional +mapping between SHA-256 and SHA-1 object names. The mapping is generated +locally and can be verified using "git fsck". Object lookups use this +mapping to allow naming objects using either their SHA-1 and SHA-256 names +interchangeably. + +"git cat-file" and "git hash-object" gain options to display an object +in its sha1 form and write an object given its sha1 form. This +requires all objects referenced by that object to be present in the +object database so that they can be named using the appropriate name +(using the bidirectional hash mapping). + +Fetches from a SHA-1 based server convert the fetched objects into +SHA-256 form and record the mapping in the bidirectional mapping table +(see below for details). Pushes to a SHA-1 based server convert the +objects being pushed into sha1 form so the server does not have to be +aware of the hash function the client is using. + +Detailed Design +--------------- +Repository format extension +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A SHA-256 repository uses repository format version `1` (see +Documentation/technical/repository-version.txt) with extensions +`objectFormat` and `compatObjectFormat`: + + [core] + repositoryFormatVersion = 1 + [extensions] + objectFormat = sha256 + compatObjectFormat = sha1 + +The combination of setting `core.repositoryFormatVersion=1` and +populating `extensions.*` ensures that all versions of Git later than +`v0.99.9l` will die instead of trying to operate on the SHA-256 +repository, instead producing an error message. + + # Between v0.99.9l and v2.7.0 + $ git status + fatal: Expected git repo version <= 0, found 1 + # After v2.7.0 + $ git status + fatal: unknown repository extensions found: + objectformat + compatobjectformat + +See the "Transition plan" section below for more details on these +repository extensions. + +Object names +~~~~~~~~~~~~ +Objects can be named by their 40 hexadecimal digit sha1-name or 64 +hexadecimal digit sha256-name, plus names derived from those (see +gitrevisions(7)). + +The sha1-name of an object is the SHA-1 of the concatenation of its +type, length, a nul byte, and the object's sha1-content. This is the +traditional <sha1> used in Git to name objects. + +The sha256-name of an object is the SHA-256 of the concatenation of its +type, length, a nul byte, and the object's sha256-content. + +Object format +~~~~~~~~~~~~~ +The content as a byte sequence of a tag, commit, or tree object named +by sha1 and sha256 differ because an object named by sha256-name refers to +other objects by their sha256-names and an object named by sha1-name +refers to other objects by their sha1-names. + +The sha256-content of an object is the same as its sha1-content, except +that objects referenced by the object are named using their sha256-names +instead of sha1-names. Because a blob object does not refer to any +other object, its sha1-content and sha256-content are the same. + +The format allows round-trip conversion between sha256-content and +sha1-content. + +Object storage +~~~~~~~~~~~~~~ +Loose objects use zlib compression and packed objects use the packed +format described in Documentation/technical/pack-format.txt, just like +today. The content that is compressed and stored uses sha256-content +instead of sha1-content. + +Pack index +~~~~~~~~~~ +Pack index (.idx) files use a new v3 format that supports multiple +hash functions. They have the following format (all integers are in +network byte order): + +- A header appears at the beginning and consists of the following: + - The 4-byte pack index signature: '\377t0c' + - 4-byte version number: 3 + - 4-byte length of the header section, including the signature and + version number + - 4-byte number of objects contained in the pack + - 4-byte number of object formats in this pack index: 2 + - For each object format: + - 4-byte format identifier (e.g., 'sha1' for SHA-1) + - 4-byte length in bytes of shortened object names. This is the + shortest possible length needed to make names in the shortened + object name table unambiguous. + - 4-byte integer, recording where tables relating to this format + are stored in this index file, as an offset from the beginning. + - 4-byte offset to the trailer from the beginning of this file. + - Zero or more additional key/value pairs (4-byte key, 4-byte + value). Only one key is supported: 'PSRC'. See the "Loose objects + and unreachable objects" section for supported values and how this + is used. All other keys are reserved. Readers must ignore + unrecognized keys. +- Zero or more NUL bytes. This can optionally be used to improve the + alignment of the full object name table below. +- Tables for the first object format: + - A sorted table of shortened object names. These are prefixes of + the names of all objects in this pack file, packed together + without offset values to reduce the cache footprint of the binary + search for a specific object name. + + - A table of full object names in pack order. This allows resolving + a reference to "the nth object in the pack file" (from a + reachability bitmap or from the next table of another object + format) to its object name. + + - A table of 4-byte values mapping object name order to pack order. + For an object in the table of sorted shortened object names, the + value at the corresponding index in this table is the index in the + previous table for that same object. + + This can be used to look up the object in reachability bitmaps or + to look up its name in another object format. + + - A table of 4-byte CRC32 values of the packed object data, in the + order that the objects appear in the pack file. This is to allow + compressed data to be copied directly from pack to pack during + repacking without undetected data corruption. + + - A table of 4-byte offset values. For an object in the table of + sorted shortened object names, the value at the corresponding + index in this table indicates where that object can be found in + the pack file. These are usually 31-bit pack file offsets, but + large offsets are encoded as an index into the next table with the + most significant bit set. + + - A table of 8-byte offset entries (empty for pack files less than + 2 GiB). Pack files are organized with heavily used objects toward + the front, so most object references should not need to refer to + this table. +- Zero or more NUL bytes. +- Tables for the second object format, with the same layout as above, + up to and not including the table of CRC32 values. +- Zero or more NUL bytes. +- The trailer consists of the following: + - A copy of the 20-byte SHA-256 checksum at the end of the + corresponding packfile. + + - 20-byte SHA-256 checksum of all of the above. + +Loose object index +~~~~~~~~~~~~~~~~~~ +A new file $GIT_OBJECT_DIR/loose-object-idx contains information about +all loose objects. Its format is + + # loose-object-idx + (sha256-name SP sha1-name LF)* + +where the object names are in hexadecimal format. The file is not +sorted. + +The loose object index is protected against concurrent writes by a +lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose +object: + +1. Write the loose object to a temporary file, like today. +2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. +3. Rename the loose object into place. +4. Open loose-object-idx with O_APPEND and write the new object +5. Unlink loose-object-idx.lock to release the lock. + +To remove entries (e.g. in "git pack-refs" or "git-prune"): + +1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the + lock. +2. Write the new content to loose-object-idx.lock. +3. Unlink any loose objects being removed. +4. Rename to replace loose-object-idx, releasing the lock. + +Translation table +~~~~~~~~~~~~~~~~~ +The index files support a bidirectional mapping between sha1-names +and sha256-names. The lookup proceeds similarly to ordinary object +lookups. For example, to convert a sha1-name to a sha256-name: + + 1. Look for the object in idx files. If a match is present in the + idx's sorted list of truncated sha1-names, then: + a. Read the corresponding entry in the sha1-name order to pack + name order mapping. + b. Read the corresponding entry in the full sha1-name table to + verify we found the right object. If it is, then + c. Read the corresponding entry in the full sha256-name table. + That is the object's sha256-name. + 2. Check for a loose object. Read lines from loose-object-idx until + we find a match. + +Step (1) takes the same amount of time as an ordinary object lookup: +O(number of packs * log(objects per pack)). Step (2) takes O(number of +loose objects) time. To maintain good performance it will be necessary +to keep the number of loose objects low. See the "Loose objects and +unreachable objects" section below for more details. + +Since all operations that make new objects (e.g., "git commit") add +the new objects to the corresponding index, this mapping is possible +for all objects in the object store. + +Reading an object's sha1-content +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The sha1-content of an object can be read by converting all sha256-names +its sha256-content references to sha1-names using the translation table. + +Fetch +~~~~~ +Fetching from a SHA-1 based server requires translating between SHA-1 +and SHA-256 based representations on the fly. + +SHA-1s named in the ref advertisement that are present on the client +can be translated to SHA-256 and looked up as local objects using the +translation table. + +Negotiation proceeds as today. Any "have"s generated locally are +converted to SHA-1 before being sent to the server, and SHA-1s +mentioned by the server are converted to SHA-256 when looking them up +locally. + +After negotiation, the server sends a packfile containing the +requested objects. We convert the packfile to SHA-256 format using +the following steps: + +1. index-pack: inflate each object in the packfile and compute its + SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against + objects the client has locally. These objects can be looked up + using the translation table and their sha1-content read as + described above to resolve the deltas. +2. topological sort: starting at the "want"s from the negotiation + phase, walk through objects in the pack and emit a list of them, + excluding blobs, in reverse topologically sorted order, with each + object coming later in the list than all objects it references. + (This list only contains objects reachable from the "wants". If the + pack from the server contained additional extraneous objects, then + they will be discarded.) +3. convert to sha256: open a new (sha256) packfile. Read the topologically + sorted list just generated. For each object, inflate its + sha1-content, convert to sha256-content, and write it to the sha256 + pack. Record the new sha1<->sha256 mapping entry for use in the idx. +4. sort: reorder entries in the new pack to match the order of objects + in the pack the server generated and include blobs. Write a sha256 idx + file +5. clean up: remove the SHA-1 based pack file, index, and + topologically sorted list obtained from the server in steps 1 + and 2. + +Step 3 requires every object referenced by the new object to be in the +translation table. This is why the topological sort step is necessary. + +As an optimization, step 1 could write a file describing what non-blob +objects each object it has inflated from the packfile references. This +makes the topological sort in step 2 possible without inflating the +objects in the packfile for a second time. The objects need to be +inflated again in step 3, for a total of two inflations. + +Step 4 is probably necessary for good read-time performance. "git +pack-objects" on the server optimizes the pack file for good data +locality (see Documentation/technical/pack-heuristics.txt). + +Details of this process are likely to change. It will take some +experimenting to get this to perform well. + +Push +~~~~ +Push is simpler than fetch because the objects referenced by the +pushed objects are already in the translation table. The sha1-content +of each object being pushed can be read as described in the "Reading +an object's sha1-content" section to generate the pack written by git +send-pack. + +Signed Commits +~~~~~~~~~~~~~~ +We add a new field "gpgsig-sha256" to the commit object format to allow +signing commits without relying on SHA-1. It is similar to the +existing "gpgsig" field. Its signed payload is the sha256-content of the +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. + +This means commits can be signed +1. using SHA-1 only, as in existing signed commit objects +2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig + fields. +3. using only SHA-256, by only using the gpgsig-sha256 field. + +Old versions of "git verify-commit" can verify the gpgsig signature in +cases (1) and (2) without modifications and view case (3) as an +ordinary unsigned commit. + +Signed Tags +~~~~~~~~~~~ +We add a new field "gpgsig-sha256" to the tag object format to allow +signing tags without relying on SHA-1. Its signed payload is the +sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP +SIGNATURE-----" delimited in-body signature removed. + +This means tags can be signed +1. using SHA-1 only, as in existing signed tag objects +2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body + signature. +3. using only SHA-256, by only using the gpgsig-sha256 field. + +Mergetag embedding +~~~~~~~~~~~~~~~~~~ +The mergetag field in the sha1-content of a commit contains the +sha1-content of a tag that was merged by that commit. + +The mergetag field in the sha256-content of the same commit contains the +sha256-content of the same tag. + +Submodules +~~~~~~~~~~ +To convert recorded submodule pointers, you need to have the converted +submodule repository in place. The translation table of the submodule +can be used to look up the new hash. + +Loose objects and unreachable objects +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Fast lookups in the loose-object-idx require that the number of loose +objects not grow too high. + +"git gc --auto" currently waits for there to be 6700 loose objects +present before consolidating them into a packfile. We will need to +measure to find a more appropriate threshold for it to use. + +"git gc --auto" currently waits for there to be 50 packs present +before combining packfiles. Packing loose objects more aggressively +may cause the number of pack files to grow too quickly. This can be +mitigated by using a strategy similar to Martin Fick's exponential +rolling garbage collection script: +https://gerrit-review.googlesource.com/c/gerrit/+/35215 + +"git gc" currently expels any unreachable objects it encounters in +pack files to loose objects in an attempt to prevent a race when +pruning them (in case another process is simultaneously writing a new +object that refers to the about-to-be-deleted object). This leads to +an explosion in the number of loose objects present and disk space +usage due to the objects in delta form being replaced with independent +loose objects. Worse, the race is still present for loose objects. + +Instead, "git gc" will need to move unreachable objects to a new +packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see +below). To avoid the race when writing new objects referring to an +about-to-be-deleted object, code paths that write new objects will +need to copy any objects from UNREACHABLE_GARBAGE packs that they +refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). +UNREACHABLE_GARBAGE are then safe to delete if their creation time (as +indicated by the file's mtime) is long enough ago. + +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be +combined under certain circumstances. If "gc.garbageTtl" is set to +greater than one day, then packs created within a single calendar day, +UTC, can be coalesced together. The resulting packfile would have an +mtime before midnight on that day, so this makes the effective maximum +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, +then we divide the calendar day into intervals one-third of that ttl +in duration. Packs created within the same interval can be coalesced +together. The resulting packfile would have an mtime before the end of +the interval, so this makes the effective maximum ttl equal to the +garbageTtl * 4/3. + +This rule comes from Thirumala Reddy Mutchukota's JGit change +https://git.eclipse.org/r/90465. + +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack +index. More generally, that field indicates where a pack came from: + + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network + - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight + "gc --auto" operation + - 3 (PACK_SOURCE_GC) for a pack created by a full gc + - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage + discovered by gc + - 5 (PACK_SOURCE_INSERT) for locally created objects that were + written directly to a pack file, e.g. from "git add ." + +This information can be useful for debugging and for "gc --auto" to +make appropriate choices about which packs to coalesce. + +Caveats +------- +Invalid objects +~~~~~~~~~~~~~~~ +The conversion from sha1-content to sha256-content retains any +brokenness in the original object (e.g., tree entry modes encoded with +leading 0, tree objects whose paths are not sorted correctly, and +commit objects without an author or committer). This is a deliberate +feature of the design to allow the conversion to round-trip. + +More profoundly broken objects (e.g., a commit with a truncated "tree" +header line) cannot be converted but were not usable by current Git +anyway. + +Shallow clone and submodules +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Because it requires all referenced objects to be available in the +locally generated translation table, this design does not support +shallow clone or unfetched submodules. Protocol improvements might +allow lifting this restriction. + +Alternates +~~~~~~~~~~ +For the same reason, a sha256 repository cannot borrow objects from a +sha1 repository using objects/info/alternates or +$GIT_ALTERNATE_OBJECT_REPOSITORIES. + +git notes +~~~~~~~~~ +The "git notes" tool annotates objects using their sha1-name as key. +This design does not describe a way to migrate notes trees to use +sha256-names. That migration is expected to happen separately (for +example using a file at the root of the notes tree to describe which +hash it uses). + +Server-side cost +~~~~~~~~~~~~~~~~ +Until Git protocol gains SHA-256 support, using SHA-256 based storage +on public-facing Git servers is strongly discouraged. Once Git +protocol gains SHA-256 support, SHA-256 based servers are likely not +to support SHA-1 compatibility, to avoid what may be a very expensive +hash reencode during clone and to encourage peers to modernize. + +The design described here allows fetches by SHA-1 clients of a +personal SHA-256 repository because it's not much more difficult than +allowing pushes from that repository. This support needs to be guarded +by a configuration option --- servers like git.kernel.org that serve a +large number of clients would not be expected to bear that cost. + +Meaning of signatures +~~~~~~~~~~~~~~~~~~~~~ +The signed payload for signed commits and tags does not explicitly +name the hash used to identify objects. If some day Git adopts a new +hash function with the same length as the current SHA-1 (40 +hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the +intent behind the PGP signed payload in an object signature is +unclear: + + object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 + type commit + tag v2.12.0 + tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 + + Git 2.12 + +Does this mean Git v2.12.0 is the commit with sha1-name +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? + +Fortunately SHA-256 and SHA-1 have different lengths. If Git starts +using another hash with the same length to name objects, then it will +need to change the format of signed payloads using that hash to +address this issue. + +Object names on the command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To support the transition (see Transition plan below), this design +supports four different modes of operation: + + 1. ("dark launch") Treat object names input by the user as SHA-1 and + convert any object names written to output to SHA-1, but store + objects using SHA-256. This allows users to test the code with no + visible behavior change except for performance. This allows + allows running even tests that assume the SHA-1 hash function, to + sanity-check the behavior of the new mode. + + 2. ("early transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-1. This allows + users to continue to make use of SHA-1 to communicate with peers + (e.g. by email) that have not migrated yet and prepares for mode 3. + + 3. ("late transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-256. In this + mode, users are using a more secure object naming method by + default. The disruption is minimal as long as most of their peers + are in mode 2 or mode 3. + + 4. ("post-transition") Treat object names input by the user as + SHA-256 and write output using SHA-256. This is safer than mode 3 + because there is less risk that input is incorrectly interpreted + using the wrong hash function. + +The mode is specified in configuration. + +The user can also explicitly specify which format to use for a +particular revision specifier and for output, overriding the mode. For +example: + +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} + +Choice of Hash +-------------- +In early 2005, around the time that Git was written, Xiaoyun Wang, +Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 +collisions in 2^69 operations. In August they published details. +Luckily, no practical demonstrations of a collision in full SHA-1 were +published until 10 years later, in 2017. + +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default that mitigates the SHAttered attack, but +SHA-1 is still believed to be weak. + +The hash to replace this hardened SHA-1 should be stronger than SHA-1 +was: we would like it to be trustworthy and useful in practice for at +least 10 years. + +Some other relevant properties: + +1. A 256-bit hash (long enough to match common security practice; not + excessively long to hurt performance and disk usage). + +2. High quality implementations should be widely available (e.g., in + OpenSSL and Apple CommonCrypto). + +3. The hash function's properties should match Git's needs (e.g. Git + requires collision and 2nd preimage resistance and does not require + length extension resistance). + +4. As a tiebreaker, the hash should be fast to compute (fortunately + many contenders are faster than SHA-1). + +We choose SHA-256. + +Transition plan +--------------- +Some initial steps can be implemented independently of one another: +- adding a hash function API (vtable) +- teaching fsck to tolerate the gpgsig-sha256 field +- excluding gpgsig-* from the fields copied by "git commit --amend" +- annotating tests that depend on SHA-1 values with a SHA1 test + prerequisite +- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ + consistently instead of "unsigned char *" and the hardcoded + constants 20 and 40. +- introducing index v3 +- adding support for the PSRC field and safer object pruning + + +The first user-visible change is the introduction of the objectFormat +extension (without compatObjectFormat). This requires: +- implementing the loose-object-idx +- teaching fsck about this mode of operation +- using the hash function API (vtable) when computing object names +- signing objects and verifying signatures +- rejecting attempts to fetch from or push to an incompatible + repository + +Next comes introduction of compatObjectFormat: +- translating object names between object formats +- translating object content between object formats +- generating and verifying signatures in the compat format +- adding appropriate index entries when adding a new object to the + object store +- --output-format option +- ^{sha1} and ^{sha256} revision notation +- configuration to specify default input and output format (see + "Object names on the command line" above) + +The next step is supporting fetches and pushes to SHA-1 repositories: +- allow pushes to a repository using the compat format +- generate a topologically sorted list of the SHA-1 names of fetched + objects +- convert the fetched packfile to sha256 format and generate an idx + file +- re-sort to match the order of objects in the fetched packfile + +The infrastructure supporting fetch also allows converting an existing +repository. In converted repositories and new clones, end users can +gain support for the new hash function without any visible change in +behavior (see "dark launch" in the "Object names on the command line" +section). In particular this allows users to verify SHA-256 signatures +on objects in the repository, and it should ensure the transition code +is stable in production in preparation for using it more widely. + +Over time projects would encourage their users to adopt the "early +transition" and then "late transition" modes to take advantage of the +new, more futureproof SHA-256 object names. + +When objectFormat and compatObjectFormat are both set, commands +generating signatures would generate both SHA-1 and SHA-256 signatures +by default to support both new and old users. + +In projects using SHA-256 heavily, users could be encouraged to adopt +the "post-transition" mode to avoid accidentally making implicit use +of SHA-1 object names. + +Once a critical mass of users have upgraded to a version of Git that +can verify SHA-256 signatures and have converted their existing +repositories to support verifying them, we can add support for a +setting to generate only SHA-256 signatures. This is expected to be at +least a year later. + +That is also a good moment to advertise the ability to convert +repositories to use SHA-256 only, stripping out all SHA-1 related +metadata. This improves performance by eliminating translation +overhead and security by avoiding the possibility of accidentally +relying on the safety of SHA-1. + +Updating Git's protocols to allow a server to specify which hash +functions it supports is also an important part of this transition. It +is not discussed in detail in this document but this transition plan +assumes it happens. :) + +Alternatives considered +----------------------- +Upgrading everyone working on a particular project on a flag day +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Projects like the Linux kernel are large and complex enough that +flipping the switch for all projects based on the repository at once +is infeasible. + +Not only would all developers and server operators supporting +developers have to switch on the same flag day, but supporting tooling +(continuous integration, code review, bug trackers, etc) would have to +be adapted as well. This also makes it difficult to get early feedback +from some project participants testing before it is time for mass +adoption. + +Using hash functions in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) +Objects newly created would be addressed by the new hash, but inside +such an object (e.g. commit) it is still possible to address objects +using the old hash function. +* You cannot trust its history (needed for bisectability) in the + future without further work +* Maintenance burden as the number of supported hash functions grows + (they will never go away, so they accumulate). In this proposal, by + comparison, converted objects lose all references to SHA-1. + +Signed objects with multiple hashes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Instead of introducing the gpgsig-sha256 field in commit and tag objects +for sha256-content based signatures, an earlier version of this design +added "hash sha256 <sha256-name>" fields to strengthen the existing +sha1-content based signatures. + +In other words, a single signature was used to attest to the object +content using both hash functions. This had some advantages: +* Using one signature instead of two speeds up the signing process. +* Having one signed payload with both hashes allows the signer to + attest to the sha1-name and sha256-name referring to the same object. +* All users consume the same signature. Broken signatures are likely + to be detected quickly using current versions of git. + +However, it also came with disadvantages: +* Verifying a signed object requires access to the sha1-names of all + objects it references, even after the transition is complete and + translation table is no longer needed for anything else. To support + this, the design added fields such as "hash sha1 tree <sha1-name>" + and "hash sha1 parent <sha1-name>" to the sha256-content of a signed + commit, complicating the conversion process. +* Allowing signed objects without a sha1 (for after the transition is + complete) complicated the design further, requiring a "nohash sha1" + field to suppress including "hash sha1" fields in the sha256-content + and signed payload. + +Lazily populated translation table +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Some of the work of building the translation table could be deferred to +push time, but that would significantly complicate and slow down pushes. +Calculating the sha1-name at object creation time at the same time it is +being streamed to disk and having its sha256-name calculated should be +an acceptable cost. + +Document History +---------------- + +2017-03-03 +bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, +sbeller@google.com + +Initial version sent to +http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com + +2017-03-03 jrnieder@gmail.com +Incorporated suggestions from jonathantanmy and sbeller: +* describe purpose of signed objects with each hash type +* redefine signed object verification using object content under the + first hash function + +2017-03-06 jrnieder@gmail.com +* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] +* Make sha3-based signatures a separate field, avoiding the need for + "hash" and "nohash" fields (thanks to peff[3]). +* Add a sorting phase to fetch (thanks to Junio for noticing the need + for this). +* Omit blobs from the topological sort during fetch (thanks to peff). +* Discuss alternates, git notes, and git servers in the caveats + section (thanks to Junio Hamano, brian m. carlson[4], and Shawn + Pearce). +* Clarify language throughout (thanks to various commenters, + especially Junio). + +2017-09-27 jrnieder@gmail.com, sbeller@google.com +* use placeholder NewHash instead of SHA3-256 +* describe criteria for picking a hash function. +* include a transition plan (thanks especially to Brandon Williams + for fleshing these ideas out) +* define the translation table (thanks, Shawn Pearce[5], Jonathan + Tan, and Masaya Suzuki) +* avoid loose object overhead by packing more aggressively in + "git gc --auto" + +Later history: + + See the history of this file in git.git for the history of subsequent + edits. This document history is no longer being maintained as it + would now be superfluous to the commit log + +[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ +[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ +[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ +[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net +[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt new file mode 100644 index 000000000000..9c5b6f0facbf --- /dev/null +++ b/Documentation/technical/http-protocol.txt @@ -0,0 +1,518 @@ +HTTP transfer protocols +======================= + +Git supports two HTTP based transfer protocols. A "dumb" protocol +which requires only a standard HTTP server on the server end of the +connection, and a "smart" protocol which requires a Git aware CGI +(or server module). This document describes both protocols. + +As a design feature smart clients can automatically upgrade "dumb" +protocol URLs to smart URLs. This permits all users to have the +same published URL, and the peers automatically select the most +efficient transport available to them. + + +URL Format +---------- + +URLs for Git repositories accessed by HTTP use the standard HTTP +URL syntax documented by RFC 1738, so they are of the form: + + http://<host>:<port>/<path>?<searchpart> + +Within this documentation the placeholder `$GIT_URL` will stand for +the http:// repository URL entered by the end-user. + +Servers SHOULD handle all requests to locations matching `$GIT_URL`, as +both the "smart" and "dumb" HTTP protocols used by Git operate +by appending additional path components onto the end of the user +supplied `$GIT_URL` string. + +An example of a dumb client requesting for a loose object: + + $GIT_URL: http://example.com:8080/git/repo.git + URL request: http://example.com:8080/git/repo.git/objects/d0/49f6c27a2244e12041955e262a404c7faba355 + +An example of a smart request to a catch-all gateway: + + $GIT_URL: http://example.com/daemon.cgi?svc=git&q= + URL request: http://example.com/daemon.cgi?svc=git&q=/info/refs&service=git-receive-pack + +An example of a request to a submodule: + + $GIT_URL: http://example.com/git/repo.git/path/submodule.git + URL request: http://example.com/git/repo.git/path/submodule.git/info/refs + +Clients MUST strip a trailing `/`, if present, from the user supplied +`$GIT_URL` string to prevent empty path tokens (`//`) from appearing +in any URL sent to a server. Compatible clients MUST expand +`$GIT_URL/info/refs` as `foo/info/refs` and not `foo//info/refs`. + + +Authentication +-------------- + +Standard HTTP authentication is used if authentication is required +to access a repository, and MAY be configured and enforced by the +HTTP server software. + +Because Git repositories are accessed by standard path components +server administrators MAY use directory based permissions within +their HTTP server to control repository access. + +Clients SHOULD support Basic authentication as described by RFC 2617. +Servers SHOULD support Basic authentication by relying upon the +HTTP server placed in front of the Git server software. + +Servers SHOULD NOT require HTTP cookies for the purposes of +authentication or access control. + +Clients and servers MAY support other common forms of HTTP based +authentication, such as Digest authentication. + + +SSL +--- + +Clients and servers SHOULD support SSL, particularly to protect +passwords when relying on Basic HTTP authentication. + + +Session State +------------- + +The Git over HTTP protocol (much like HTTP itself) is stateless +from the perspective of the HTTP server side. All state MUST be +retained and managed by the client process. This permits simple +round-robin load-balancing on the server side, without needing to +worry about state management. + +Clients MUST NOT require state management on the server side in +order to function correctly. + +Servers MUST NOT require HTTP cookies in order to function correctly. +Clients MAY store and forward HTTP cookies during request processing +as described by RFC 2616 (HTTP/1.1). Servers SHOULD ignore any +cookies sent by a client. + + +General Request Processing +-------------------------- + +Except where noted, all standard HTTP behavior SHOULD be assumed +by both client and server. This includes (but is not necessarily +limited to): + +If there is no repository at `$GIT_URL`, or the resource pointed to by a +location matching `$GIT_URL` does not exist, the server MUST NOT respond +with `200 OK` response. A server SHOULD respond with +`404 Not Found`, `410 Gone`, or any other suitable HTTP status code +which does not imply the resource exists as requested. + +If there is a repository at `$GIT_URL`, but access is not currently +permitted, the server MUST respond with the `403 Forbidden` HTTP +status code. + +Servers SHOULD support both HTTP 1.0 and HTTP 1.1. +Servers SHOULD support chunked encoding for both request and response +bodies. + +Clients SHOULD support both HTTP 1.0 and HTTP 1.1. +Clients SHOULD support chunked encoding for both request and response +bodies. + +Servers MAY return ETag and/or Last-Modified headers. + +Clients MAY revalidate cached entities by including If-Modified-Since +and/or If-None-Match request headers. + +Servers MAY return `304 Not Modified` if the relevant headers appear +in the request and the entity has not changed. Clients MUST treat +`304 Not Modified` identical to `200 OK` by reusing the cached entity. + +Clients MAY reuse a cached entity without revalidation if the +Cache-Control and/or Expires header permits caching. Clients and +servers MUST follow RFC 2616 for cache controls. + + +Discovering References +---------------------- + +All HTTP clients MUST begin either a fetch or a push exchange by +discovering the references available on the remote repository. + +Dumb Clients +~~~~~~~~~~~~ + +HTTP clients that only support the "dumb" protocol MUST discover +references by making a request for the special info/refs file of +the repository. + +Dumb HTTP clients MUST make a `GET` request to `$GIT_URL/info/refs`, +without any search/query parameters. + + C: GET $GIT_URL/info/refs HTTP/1.0 + + S: 200 OK + S: + S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint + S: d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master + S: 2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0 + S: a3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{} + +The Content-Type of the returned info/refs entity SHOULD be +`text/plain; charset=utf-8`, but MAY be any content type. +Clients MUST NOT attempt to validate the returned Content-Type. +Dumb servers MUST NOT return a return type starting with +`application/x-git-`. + +Cache-Control headers MAY be returned to disable caching of the +returned entity. + +When examining the response clients SHOULD only examine the HTTP +status code. Valid responses are `200 OK`, or `304 Not Modified`. + +The returned content is a UNIX formatted text file describing +each ref and its known value. The file SHOULD be sorted by name +according to the C locale ordering. The file SHOULD NOT include +the default ref named `HEAD`. + + info_refs = *( ref_record ) + ref_record = any_ref / peeled_ref + + any_ref = obj-id HTAB refname LF + peeled_ref = obj-id HTAB refname LF + obj-id HTAB refname "^{}" LF + +Smart Clients +~~~~~~~~~~~~~ + +HTTP clients that support the "smart" protocol (or both the +"smart" and "dumb" protocols) MUST discover references by making +a parameterized request for the info/refs file of the repository. + +The request MUST contain exactly one query parameter, +`service=$servicename`, where `$servicename` MUST be the service +name the client wishes to contact to complete the operation. +The request MUST NOT contain additional query parameters. + + C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 + +dumb server reply: + + S: 200 OK + S: + S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint + S: d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master + S: 2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0 + S: a3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{} + +smart server reply: + + S: 200 OK + S: Content-Type: application/x-git-upload-pack-advertisement + S: Cache-Control: no-cache + S: + S: 001e# service=git-upload-pack\n + S: 0000 + S: 004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint\0multi_ack\n + S: 0042d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n + S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n + S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n + S: 0000 + +The client may send Extra Parameters (see +Documentation/technical/pack-protocol.txt) as a colon-separated string +in the Git-Protocol HTTP header. + +Dumb Server Response +^^^^^^^^^^^^^^^^^^^^ +Dumb servers MUST respond with the dumb server reply format. + +See the prior section under dumb clients for a more detailed +description of the dumb server response. + +Smart Server Response +^^^^^^^^^^^^^^^^^^^^^ +If the server does not recognize the requested service name, or the +requested service name has been disabled by the server administrator, +the server MUST respond with the `403 Forbidden` HTTP status code. + +Otherwise, smart servers MUST respond with the smart server reply +format for the requested service name. + +Cache-Control headers SHOULD be used to disable caching of the +returned entity. + +The Content-Type MUST be `application/x-$servicename-advertisement`. +Clients SHOULD fall back to the dumb protocol if another content +type is returned. When falling back to the dumb protocol clients +SHOULD NOT make an additional request to `$GIT_URL/info/refs`, but +instead SHOULD use the response already in hand. Clients MUST NOT +continue if they do not support the dumb protocol. + +Clients MUST validate the status code is either `200 OK` or +`304 Not Modified`. + +Clients MUST validate the first five bytes of the response entity +matches the regex `^[0-9a-f]{4}#`. If this test fails, clients +MUST NOT continue. + +Clients MUST parse the entire response as a sequence of pkt-line +records. + +Clients MUST verify the first pkt-line is `# service=$servicename`. +Servers MUST set $servicename to be the request parameter value. +Servers SHOULD include an LF at the end of this line. +Clients MUST ignore an LF at the end of the line. + +Servers MUST terminate the response with the magic `0000` end +pkt-line marker. + +The returned response is a pkt-line stream describing each ref and +its known value. The stream SHOULD be sorted by name according to +the C locale ordering. The stream SHOULD include the default ref +named `HEAD` as the first ref. The stream MUST include capability +declarations behind a NUL on the first ref. + +The returned response contains "version 1" if "version=1" was sent as an +Extra Parameter. + + smart_reply = PKT-LINE("# service=$servicename" LF) + "0000" + *1("version 1") + ref_list + "0000" + ref_list = empty_list / non_empty_list + + empty_list = PKT-LINE(zero-id SP "capabilities^{}" NUL cap-list LF) + + non_empty_list = PKT-LINE(obj-id SP name NUL cap_list LF) + *ref_record + + cap-list = capability *(SP capability) + capability = 1*(LC_ALPHA / DIGIT / "-" / "_") + LC_ALPHA = %x61-7A + + ref_record = any_ref / peeled_ref + any_ref = PKT-LINE(obj-id SP name LF) + peeled_ref = PKT-LINE(obj-id SP name LF) + PKT-LINE(obj-id SP name "^{}" LF + + +Smart Service git-upload-pack +------------------------------ +This service reads from the repository pointed to by `$GIT_URL`. + +Clients MUST first perform ref discovery with +`$GIT_URL/info/refs?service=git-upload-pack`. + + C: POST $GIT_URL/git-upload-pack HTTP/1.0 + C: Content-Type: application/x-git-upload-pack-request + C: + C: 0032want 0a53e9ddeaddad63ad106860237bbf53411d11a7\n + C: 0032have 441b40d833fdfa93eb2908e52742248faf0ee993\n + C: 0000 + + S: 200 OK + S: Content-Type: application/x-git-upload-pack-result + S: Cache-Control: no-cache + S: + S: ....ACK %s, continue + S: ....NAK + +Clients MUST NOT reuse or revalidate a cached response. +Servers MUST include sufficient Cache-Control headers +to prevent caching of the response. + +Servers SHOULD support all capabilities defined here. + +Clients MUST send at least one "want" command in the request body. +Clients MUST NOT reference an id in a "want" command which did not +appear in the response obtained through ref discovery unless the +server advertises capability `allow-tip-sha1-in-want` or +`allow-reachable-sha1-in-want`. + + compute_request = want_list + have_list + request_end + request_end = "0000" / "done" + + want_list = PKT-LINE(want SP cap_list LF) + *(want_pkt) + want_pkt = PKT-LINE(want LF) + want = "want" SP id + cap_list = capability *(SP capability) + + have_list = *PKT-LINE("have" SP id LF) + +TODO: Document this further. + +The Negotiation Algorithm +~~~~~~~~~~~~~~~~~~~~~~~~~ +The computation to select the minimal pack proceeds as follows +(C = client, S = server): + +'init step:' + +C: Use ref discovery to obtain the advertised refs. + +C: Place any object seen into set `advertised`. + +C: Build an empty set, `common`, to hold the objects that are later + determined to be on both ends. + +C: Build a set, `want`, of the objects from `advertised` the client + wants to fetch, based on what it saw during ref discovery. + +C: Start a queue, `c_pending`, ordered by commit time (popping newest + first). Add all client refs. When a commit is popped from + the queue its parents SHOULD be automatically inserted back. + Commits MUST only enter the queue once. + +'one compute step:' + +C: Send one `$GIT_URL/git-upload-pack` request: + + C: 0032want <want #1>............................... + C: 0032want <want #2>............................... + .... + C: 0032have <common #1>............................. + C: 0032have <common #2>............................. + .... + C: 0032have <have #1>............................... + C: 0032have <have #2>............................... + .... + C: 0000 + +The stream is organized into "commands", with each command +appearing by itself in a pkt-line. Within a command line, +the text leading up to the first space is the command name, +and the remainder of the line to the first LF is the value. +Command lines are terminated with an LF as the last byte of +the pkt-line value. + +Commands MUST appear in the following order, if they appear +at all in the request stream: + +* "want" +* "have" + +The stream is terminated by a pkt-line flush (`0000`). + +A single "want" or "have" command MUST have one hex formatted +SHA-1 as its value. Multiple SHA-1s MUST be sent by sending +multiple commands. + +The `have` list is created by popping the first 32 commits +from `c_pending`. Less can be supplied if `c_pending` empties. + +If the client has sent 256 "have" commits and has not yet +received one of those back from `s_common`, or the client has +emptied `c_pending` it SHOULD include a "done" command to let +the server know it won't proceed: + + C: 0009done + +S: Parse the git-upload-pack request: + +Verify all objects in `want` are directly reachable from refs. + +The server MAY walk backwards through history or through +the reflog to permit slightly stale requests. + +If no "want" objects are received, send an error: +TODO: Define error if no "want" lines are requested. + +If any "want" object is not reachable, send an error: +TODO: Define error if an invalid "want" is requested. + +Create an empty list, `s_common`. + +If "have" was sent: + +Loop through the objects in the order supplied by the client. + +For each object, if the server has the object reachable from +a ref, add it to `s_common`. If a commit is added to `s_common`, +do not add any ancestors, even if they also appear in `have`. + +S: Send the git-upload-pack response: + +If the server has found a closed set of objects to pack or the +request ends with "done", it replies with the pack. +TODO: Document the pack based response + + S: PACK... + +The returned stream is the side-band-64k protocol supported +by the git-upload-pack service, and the pack is embedded into +stream 1. Progress messages from the server side MAY appear +in stream 2. + +Here a "closed set of objects" is defined to have at least +one path from every "want" to at least one "common" object. + +If the server needs more information, it replies with a +status continue response: +TODO: Document the non-pack response + +C: Parse the upload-pack response: + TODO: Document parsing response + +'Do another compute step.' + + +Smart Service git-receive-pack +------------------------------ +This service reads from the repository pointed to by `$GIT_URL`. + +Clients MUST first perform ref discovery with +`$GIT_URL/info/refs?service=git-receive-pack`. + + C: POST $GIT_URL/git-receive-pack HTTP/1.0 + C: Content-Type: application/x-git-receive-pack-request + C: + C: ....0a53e9ddeaddad63ad106860237bbf53411d11a7 441b40d833fdfa93eb2908e52742248faf0ee993 refs/heads/maint\0 report-status + C: 0000 + C: PACK.... + + S: 200 OK + S: Content-Type: application/x-git-receive-pack-result + S: Cache-Control: no-cache + S: + S: .... + +Clients MUST NOT reuse or revalidate a cached response. +Servers MUST include sufficient Cache-Control headers +to prevent caching of the response. + +Servers SHOULD support all capabilities defined here. + +Clients MUST send at least one command in the request body. +Within the command portion of the request body clients SHOULD send +the id obtained through ref discovery as old_id. + + update_request = command_list + "PACK" <binary data> + + command_list = PKT-LINE(command NUL cap_list LF) + *(command_pkt) + command_pkt = PKT-LINE(command LF) + cap_list = *(SP capability) SP + + command = create / delete / update + create = zero-id SP new_id SP name + delete = old_id SP zero-id SP name + update = old_id SP new_id SP name + +TODO: Document this further. + + +References +---------- + +http://www.ietf.org/rfc/rfc1738.txt[RFC 1738: Uniform Resource Locators (URL)] +http://www.ietf.org/rfc/rfc2616.txt[RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1] +link:technical/pack-protocol.html +link:technical/protocol-capabilities.html diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt new file mode 100644 index 000000000000..7c4d67aa6a7f --- /dev/null +++ b/Documentation/technical/index-format.txt @@ -0,0 +1,357 @@ +Git index format +================ + +== The Git index file has the following format + + All binary numbers are in network byte order. Version 2 is described + here unless stated otherwise. + + - A 12-byte header consisting of + + 4-byte signature: + The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache") + + 4-byte version number: + The current supported versions are 2, 3 and 4. + + 32-bit number of index entries. + + - A number of sorted index entries (see below). + + - Extensions + + Extensions are identified by signature. Optional extensions can + be ignored if Git does not understand them. + + Git currently supports cached tree and resolve undo extensions. + + 4-byte extension signature. If the first byte is 'A'..'Z' the + extension is optional and can be ignored. + + 32-bit size of the extension + + Extension data + + - 160-bit SHA-1 over the content of the index file before this + checksum. + +== Index entry + + Index entries are sorted in ascending order on the name field, + interpreted as a string of unsigned bytes (i.e. memcmp() order, no + localization, no special casing of directory separator '/'). Entries + with the same name are sorted by their stage field. + + 32-bit ctime seconds, the last time a file's metadata changed + this is stat(2) data + + 32-bit ctime nanosecond fractions + this is stat(2) data + + 32-bit mtime seconds, the last time a file's data changed + this is stat(2) data + + 32-bit mtime nanosecond fractions + this is stat(2) data + + 32-bit dev + this is stat(2) data + + 32-bit ino + this is stat(2) data + + 32-bit mode, split into (high to low bits) + + 4-bit object type + valid values in binary are 1000 (regular file), 1010 (symbolic link) + and 1110 (gitlink) + + 3-bit unused + + 9-bit unix permission. Only 0755 and 0644 are valid for regular files. + Symbolic links and gitlinks have value 0 in this field. + + 32-bit uid + this is stat(2) data + + 32-bit gid + this is stat(2) data + + 32-bit file size + This is the on-disk size from stat(2), truncated to 32-bit. + + 160-bit SHA-1 for the represented object + + A 16-bit 'flags' field split into (high to low bits) + + 1-bit assume-valid flag + + 1-bit extended flag (must be zero in version 2) + + 2-bit stage (during merge) + + 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF + is stored in this field. + + (Version 3 or later) A 16-bit field, only applicable if the + "extended flag" above is 1, split into (high to low bits). + + 1-bit reserved for future + + 1-bit skip-worktree flag (used by sparse checkout) + + 1-bit intent-to-add flag (used by "git add -N") + + 13-bit unused, must be zero + + Entry path name (variable length) relative to top level directory + (without leading slash). '/' is used as path separator. The special + path components ".", ".." and ".git" (without quotes) are disallowed. + Trailing slash is also disallowed. + + The exact encoding is undefined, but the '.' and '/' characters + are encoded in 7-bit ASCII and the encoding cannot contain a NUL + byte (iow, this is a UNIX pathname). + + (Version 4) In version 4, the entry path name is prefix-compressed + relative to the path name for the previous entry (the very first + entry is encoded as if the path name for the previous entry is an + empty string). At the beginning of an entry, an integer N in the + variable width encoding (the same encoding as the offset is encoded + for OFS_DELTA pack entries; see pack-format.txt) is stored, followed + by a NUL-terminated string S. Removing N bytes from the end of the + path name for the previous entry, and replacing it with the string S + yields the path name for this entry. + + 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes + while keeping the name NUL-terminated. + + (Version 4) In version 4, the padding after the pathname does not + exist. + + Interpretation of index entries in split index mode is completely + different. See below for details. + +== Extensions + +=== Cached tree + + Cached tree extension contains pre-computed hashes for trees that can + be derived from the index. It helps speed up tree object generation + from index for a new commit. + + When a path is updated in index, the path must be invalidated and + removed from tree cache. + + The signature for this extension is { 'T', 'R', 'E', 'E' }. + + A series of entries fill the entire extension; each of which + consists of: + + - NUL-terminated path component (relative to its parent directory); + + - ASCII decimal number of entries in the index that is covered by the + tree this entry represents (entry_count); + + - A space (ASCII 32); + + - ASCII decimal number that represents the number of subtrees this + tree has; + + - A newline (ASCII 10); and + + - 160-bit object name for the object that would result from writing + this span of index as a tree. + + An entry can be in an invalidated state and is represented by having + a negative number in the entry_count field. In this case, there is no + object name and the next entry starts immediately after the newline. + When writing an invalid entry, -1 should always be used as entry_count. + + The entries are written out in the top-down, depth-first order. The + first entry represents the root level of the repository, followed by the + first subtree--let's call this A--of the root level (with its name + relative to the root level), followed by the first subtree of A (with + its name relative to A), ... + +=== Resolve undo + + A conflict is represented in the index as a set of higher stage entries. + When a conflict is resolved (e.g. with "git add path"), these higher + stage entries will be removed and a stage-0 entry with proper resolution + is added. + + When these higher stage entries are removed, they are saved in the + resolve undo extension, so that conflicts can be recreated (e.g. with + "git checkout -m"), in case users want to redo a conflict resolution + from scratch. + + The signature for this extension is { 'R', 'E', 'U', 'C' }. + + A series of entries fill the entire extension; each of which + consists of: + + - NUL-terminated pathname the entry describes (relative to the root of + the repository, i.e. full pathname); + + - Three NUL-terminated ASCII octal numbers, entry mode of entries in + stage 1 to 3 (a missing stage is represented by "0" in this field); + and + + - At most three 160-bit object names of the entry in stages from 1 to 3 + (nothing is written for a missing stage). + +=== Split index + + In split index mode, the majority of index entries could be stored + in a separate file. This extension records the changes to be made on + top of that to produce the final index. + + The signature for this extension is { 'l', 'i', 'n', 'k' }. + + The extension consists of: + + - 160-bit SHA-1 of the shared index file. The shared index file path + is $GIT_DIR/sharedindex.<SHA-1>. If all 160 bits are zero, the + index does not require a shared index file. + + - An ewah-encoded delete bitmap, each bit represents an entry in the + shared index. If a bit is set, its corresponding entry in the + shared index will be removed from the final index. Note, because + a delete operation changes index entry positions, but we do need + original positions in replace phase, it's best to just mark + entries for removal, then do a mass deletion after replacement. + + - An ewah-encoded replace bitmap, each bit represents an entry in + the shared index. If a bit is set, its corresponding entry in the + shared index will be replaced with an entry in this index + file. All replaced entries are stored in sorted order in this + index. The first "1" bit in the replace bitmap corresponds to the + first index entry, the second "1" bit to the second entry and so + on. Replaced entries may have empty path names to save space. + + The remaining index entries after replaced ones will be added to the + final index. These added entries are also sorted by entry name then + stage. + +== Untracked cache + + Untracked cache saves the untracked file list and necessary data to + verify the cache. The signature for this extension is { 'U', 'N', + 'T', 'R' }. + + The extension starts with + + - A sequence of NUL-terminated strings, preceded by the size of the + sequence in variable width encoding. Each string describes the + environment where the cache can be used. + + - Stat data of $GIT_DIR/info/exclude. See "Index entry" section from + ctime field until "file size". + + - Stat data of core.excludesfile + + - 32-bit dir_flags (see struct dir_struct) + + - 160-bit SHA-1 of $GIT_DIR/info/exclude. Null SHA-1 means the file + does not exist. + + - 160-bit SHA-1 of core.excludesfile. Null SHA-1 means the file does + not exist. + + - NUL-terminated string of per-dir exclude file name. This usually + is ".gitignore". + + - The number of following directory blocks, variable width + encoding. If this number is zero, the extension ends here with a + following NUL. + + - A number of directory blocks in depth-first-search order, each + consists of + + - The number of untracked entries, variable width encoding. + + - The number of sub-directory blocks, variable width encoding. + + - The directory name terminated by NUL. + + - A number of untracked file/dir names terminated by NUL. + +The remaining data of each directory block is grouped by type: + + - An ewah bitmap, the n-th bit marks whether the n-th directory has + valid untracked cache entries. + + - An ewah bitmap, the n-th bit records "check-only" bit of + read_directory_recursive() for the n-th directory. + + - An ewah bitmap, the n-th bit indicates whether SHA-1 and stat data + is valid for the n-th directory and exists in the next data. + + - An array of stat data. The n-th data corresponds with the n-th + "one" bit in the previous ewah bitmap. + + - An array of SHA-1. The n-th SHA-1 corresponds with the n-th "one" bit + in the previous ewah bitmap. + + - One NUL. + +== File System Monitor cache + + The file system monitor cache tracks files for which the core.fsmonitor + hook has told us about changes. The signature for this extension is + { 'F', 'S', 'M', 'N' }. + + The extension starts with + + - 32-bit version number: the current supported version is 1. + + - 64-bit time: the extension data reflects all changes through the given + time which is stored as the nanoseconds elapsed since midnight, + January 1, 1970. + + - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap. + + - An ewah bitmap, the n-th bit indicates whether the n-th index entry + is not CE_FSMONITOR_VALID. + +== End of Index Entry + + The End of Index Entry (EOIE) is used to locate the end of the variable + length index entries and the begining of the extensions. Code can take + advantage of this to quickly locate the index extensions without having + to parse through all of the index entries. + + Because it must be able to be loaded before the variable length cache + entries and other index extensions, this extension must be written last. + The signature for this extension is { 'E', 'O', 'I', 'E' }. + + The extension consists of: + + - 32-bit offset to the end of the index entries + + - 160-bit SHA-1 over the extension types and their sizes (but not + their contents). E.g. if we have "TREE" extension that is N-bytes + long, "REUC" extension that is M-bytes long, followed by "EOIE", + then the hash would be: + + SHA-1("TREE" + <binary representation of N> + + "REUC" + <binary representation of M>) + +== Index Entry Offset Table + + The Index Entry Offset Table (IEOT) is used to help address the CPU + cost of loading the index by enabling multi-threading the process of + converting cache entries from the on-disk format to the in-memory format. + The signature for this extension is { 'I', 'E', 'O', 'T' }. + + The extension consists of: + + - 32-bit version (currently 1) + + - A number of index offset entries each consisting of: + + - 32-bit offset from the begining of the file to the first cache entry + in this block of entries. + + - 32-bit count of cache entries in this block diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt new file mode 100644 index 000000000000..aa0aa9af1c2e --- /dev/null +++ b/Documentation/technical/long-running-process-protocol.txt @@ -0,0 +1,50 @@ +Long-running process protocol +============================= + +This protocol is used when Git needs to communicate with an external +process throughout the entire life of a single Git command. All +communication is in pkt-line format (see technical/protocol-common.txt) +over standard input and standard output. + +Handshake +--------- + +Git starts by sending a welcome message (for example, +"git-filter-client"), a list of supported protocol version numbers, and +a flush packet. Git expects to read the welcome message with "server" +instead of "client" (for example, "git-filter-server"), exactly one +protocol version number from the previously sent list, and a flush +packet. All further communication will be based on the selected version. +The remaining protocol description below documents "version=2". Please +note that "version=42" in the example below does not exist and is only +there to illustrate how the protocol would look like with more than one +version. + +After the version negotiation Git sends a list of all capabilities that +it supports and a flush packet. Git expects to read a list of desired +capabilities, which must be a subset of the supported capabilities list, +and a flush packet as response: +------------------------ +packet: git> git-filter-client +packet: git> version=2 +packet: git> version=42 +packet: git> 0000 +packet: git< git-filter-server +packet: git< version=2 +packet: git< 0000 +packet: git> capability=clean +packet: git> capability=smudge +packet: git> capability=not-yet-invented +packet: git> 0000 +packet: git< capability=clean +packet: git< capability=smudge +packet: git< 0000 +------------------------ + +Shutdown +-------- + +Git will close +the command pipe on exit. The filter is expected to detect EOF +and exit gracefully on its own. Git will wait until the filter +process has stopped. diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt new file mode 100644 index 000000000000..d7e57639f70d --- /dev/null +++ b/Documentation/technical/multi-pack-index.txt @@ -0,0 +1,109 @@ +Multi-Pack-Index (MIDX) Design Notes +==================================== + +The Git object directory contains a 'pack' directory containing +packfiles (with suffix ".pack") and pack-indexes (with suffix +".idx"). The pack-indexes provide a way to lookup objects and +navigate to their offset within the pack, but these must come +in pairs with the packfiles. This pairing depends on the file +names, as the pack-index differs only in suffix with its pack- +file. While the pack-indexes provide fast lookup per packfile, +this performance degrades as the number of packfiles increases, +because abbreviations need to inspect every packfile and we are +more likely to have a miss on our most-recently-used packfile. +For some large repositories, repacking into a single packfile +is not feasible due to storage space or excessive repack times. + +The multi-pack-index (MIDX for short) stores a list of objects +and their offsets into multiple packfiles. It contains: + +- A list of packfile names. +- A sorted list of object IDs. +- A list of metadata for the ith object ID including: + - A value j referring to the jth packfile. + - An offset within the jth packfile for the object. +- If large offsets are required, we use another list of large + offsets similar to version 2 pack-indexes. + +Thus, we can provide O(log N) lookup time for any number +of packfiles. + +Design Details +-------------- + +- The MIDX is stored in a file named 'multi-pack-index' in the + .git/objects/pack directory. This could be stored in the pack + directory of an alternate. It refers only to packfiles in that + same directory. + +- The pack.multiIndex config setting must be on to consume MIDX files. + +- The file format includes parameters for the object ID hash + function, so a future change of hash algorithm does not require + a change in format. + +- The MIDX keeps only one record per object ID. If an object appears + in multiple packfiles, then the MIDX selects the copy in the most- + recently modified packfile. + +- If there exist packfiles in the pack directory not registered in + the MIDX, then those packfiles are loaded into the `packed_git` + list and `packed_git_mru` cache. + +- The pack-indexes (.idx files) remain in the pack directory so we + can delete the MIDX file, set core.midx to false, or downgrade + without any loss of information. + +- The MIDX file format uses a chunk-based approach (similar to the + commit-graph file) that allows optional data to be added. + +Future Work +----------- + +- Add a 'verify' subcommand to the 'git midx' builtin to verify the + contents of the multi-pack-index file match the offsets listed in + the corresponding pack-indexes. + +- The multi-pack-index allows many packfiles, especially in a context + where repacking is expensive (such as a very large repo), or + unexpected maintenance time is unacceptable (such as a high-demand + build machine). However, the multi-pack-index needs to be rewritten + in full every time. We can extend the format to be incremental, so + writes are fast. By storing a small "tip" multi-pack-index that + points to large "base" MIDX files, we can keep writes fast while + still reducing the number of binary searches required for object + lookups. + +- The reachability bitmap is currently paired directly with a single + packfile, using the pack-order as the object order to hopefully + compress the bitmaps well using run-length encoding. This could be + extended to pair a reachability bitmap with a multi-pack-index. If + the multi-pack-index is extended to store a "stable object order" + (a function Order(hash) = integer that is constant for a given hash, + even as the multi-pack-index is updated) then a reachability bitmap + could point to a multi-pack-index and be updated independently. + +- Packfiles can be marked as "special" using empty files that share + the initial name but replace ".pack" with ".keep" or ".promisor". + We can add an optional chunk of data to the multi-pack-index that + records flags of information about the packfiles. This allows new + states, such as 'repacked' or 'redeltified', that can help with + pack maintenance in a multi-pack environment. It may also be + helpful to organize packfiles by object type (commit, tree, blob, + etc.) and use this metadata to help that maintenance. + +- The partial clone feature records special "promisor" packs that + may point to objects that are not stored locally, but available + on request to a server. The multi-pack-index does not currently + track these promisor packs. + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=6 + Chromium work item for: Multi-Pack Index (MIDX) + +[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/ + An earlier RFC for the multi-pack-index feature + +[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ + Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt new file mode 100644 index 000000000000..cab5bdd2ff0f --- /dev/null +++ b/Documentation/technical/pack-format.txt @@ -0,0 +1,331 @@ +Git pack format +=============== + +== pack-*.pack files have the following format: + + - A header appears at the beginning and consists of the following: + + 4-byte signature: + The signature is: {'P', 'A', 'C', 'K'} + + 4-byte version number (network byte order): + Git currently accepts version number 2 or 3 but + generates version 2 only. + + 4-byte number of objects contained in the pack (network byte order) + + Observation: we cannot have more than 4G versions ;-) and + more than 4G objects in a pack. + + - The header is followed by number of object entries, each of + which looks like this: + + (undeltified representation) + n-byte type and length (3-bit type, (n-1)*7+4-bit length) + compressed data + + (deltified representation) + n-byte type and length (3-bit type, (n-1)*7+4-bit length) + 20-byte base object name if OBJ_REF_DELTA or a negative relative + offset from the delta object's position in the pack if this + is an OBJ_OFS_DELTA object + compressed delta data + + Observation: length of each object is encoded in a variable + length format and is not constrained to 32-bit or anything. + + - The trailer records 20-byte SHA-1 checksum of all of the above. + +=== Object types + +Valid object types are: + +- OBJ_COMMIT (1) +- OBJ_TREE (2) +- OBJ_BLOB (3) +- OBJ_TAG (4) +- OBJ_OFS_DELTA (6) +- OBJ_REF_DELTA (7) + +Type 5 is reserved for future expansion. Type 0 is invalid. + +=== Deltified representation + +Conceptually there are only four object types: commit, tree, tag and +blob. However to save space, an object could be stored as a "delta" of +another "base" object. These representations are assigned new types +ofs-delta and ref-delta, which is only valid in a pack file. + +Both ofs-delta and ref-delta store the "delta" to be applied to +another object (called 'base object') to reconstruct the object. The +difference between them is, ref-delta directly encodes 20-byte base +object name. If the base object is in the same pack, ofs-delta encodes +the offset of the base object in the pack instead. + +The base object could also be deltified if it's in the same pack. +Ref-delta can also refer to an object outside the pack (i.e. the +so-called "thin pack"). When stored on disk however, the pack should +be self contained to avoid cyclic dependency. + +The delta data is a sequence of instructions to reconstruct an object +from the base object. If the base object is deltified, it must be +converted to canonical form first. Each instruction appends more and +more data to the target object until it's complete. There are two +supported instructions so far: one for copy a byte range from the +source object and one for inserting new data embedded in the +instruction itself. + +Each instruction has variable length. Instruction type is determined +by the seventh bit of the first octet. The following diagrams follow +the convention in RFC 1951 (Deflate compressed data format). + +==== Instruction to copy from base object + + +----------+---------+---------+---------+---------+-------+-------+-------+ + | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | + +----------+---------+---------+---------+---------+-------+-------+-------+ + +This is the instruction format to copy a byte range from the source +object. It encodes the offset to copy from and the number of bytes to +copy. Offset and size are in little-endian order. + +All offset and size bytes are optional. This is to reduce the +instruction size when encoding small offsets or sizes. The first seven +bits in the first octet determines which of the next seven octets is +present. If bit zero is set, offset1 is present. If bit one is set +offset2 is present and so on. + +Note that a more compact instruction does not change offset and size +encoding. For example, if only offset2 is omitted like below, offset3 +still contains bits 16-23. It does not become offset2 and contains +bits 8-15 even if it's right next to offset1. + + +----------+---------+---------+ + | 10000101 | offset1 | offset3 | + +----------+---------+---------+ + +In its most compact form, this instruction only takes up one byte +(0x80) with both offset and size omitted, which will have default +values zero. There is another exception: size zero is automatically +converted to 0x10000. + +==== Instruction to add new data + + +----------+============+ + | 0xxxxxxx | data | + +----------+============+ + +This is the instruction to construct target object without the base +object. The following data is appended to the target object. The first +seven bits of the first octet determines the size of data in +bytes. The size must be non-zero. + +==== Reserved instruction + + +----------+============ + | 00000000 | + +----------+============ + +This is the instruction reserved for future expansion. + +== Original (version 1) pack-*.idx files have the following format: + + - The header consists of 256 4-byte network byte order + integers. N-th entry of this table records the number of + objects in the corresponding pack, the first byte of whose + object name is less than or equal to N. This is called the + 'first-level fan-out' table. + + - The header is followed by sorted 24-byte entries, one entry + per object in the pack. Each entry is: + + 4-byte network byte order integer, recording where the + object is stored in the packfile as the offset from the + beginning. + + 20-byte object name. + + - The file is concluded with a trailer: + + A copy of the 20-byte SHA-1 checksum at the end of + corresponding packfile. + + 20-byte SHA-1-checksum of all of the above. + +Pack Idx file: + + -- +--------------------------------+ +fanout | fanout[0] = 2 (for example) |-. +table +--------------------------------+ | + | fanout[1] | | + +--------------------------------+ | + | fanout[2] | | + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | + | fanout[255] = total objects |---. + -- +--------------------------------+ | | +main | offset | | | +index | object name 00XXXXXXXXXXXXXXXX | | | +table +--------------------------------+ | | + | offset | | | + | object name 00XXXXXXXXXXXXXXXX | | | + +--------------------------------+<+ | + .-| offset | | + | | object name 01XXXXXXXXXXXXXXXX | | + | +--------------------------------+ | + | | offset | | + | | object name 01XXXXXXXXXXXXXXXX | | + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | + | | offset | | + | | object name FFXXXXXXXXXXXXXXXX | | + --| +--------------------------------+<--+ +trailer | | packfile checksum | + | +--------------------------------+ + | | idxfile checksum | + | +--------------------------------+ + .-------. + | +Pack file entry: <+ + + packed object header: + 1-byte size extension bit (MSB) + type (next 3 bit) + size0 (lower 4-bit) + n-byte sizeN (as long as MSB is set, each 7-bit) + size0..sizeN form 4+7+7+..+7 bit integer, size0 + is the least significant part, and sizeN is the + most significant part. + packed object data: + If it is not DELTA, then deflated bytes (the size above + is the size before compression). + If it is REF_DELTA, then + 20-byte base object name SHA-1 (the size above is the + size of the delta data that follows). + delta data, deflated. + If it is OFS_DELTA, then + n-byte offset (see below) interpreted as a negative + offset from the type-byte of the header of the + ofs-delta entry (the size above is the size of + the delta data that follows). + delta data, deflated. + + offset encoding: + n bytes with MSB set in all but the last one. + The offset is then the number constructed by + concatenating the lower 7 bit of each byte, and + for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1)) + to the result. + + + +== Version 2 pack-*.idx files support packs larger than 4 GiB, and + have some other reorganizations. They have the format: + + - A 4-byte magic number '\377tOc' which is an unreasonable + fanout[0] value. + + - A 4-byte version number (= 2) + + - A 256-entry fan-out table just like v1. + + - A table of sorted 20-byte SHA-1 object names. These are + packed together without offset values to reduce the cache + footprint of the binary search for a specific object name. + + - A table of 4-byte CRC32 values of the packed object data. + This is new in v2 so compressed data can be copied directly + from pack to pack during repacking without undetected + data corruption. + + - A table of 4-byte offset values (in network byte order). + These are usually 31-bit pack file offsets, but large + offsets are encoded as an index into the next table with + the msbit set. + + - A table of 8-byte offset entries (empty for pack files less + than 2 GiB). Pack files are organized with heavily used + objects toward the front, so most object references should + not need to refer to this table. + + - The same trailer as a v1 pack file: + + A copy of the 20-byte SHA-1 checksum at the end of + corresponding packfile. + + 20-byte SHA-1-checksum of all of the above. + +== multi-pack-index (MIDX) files have the following format: + +The multi-pack-index files refer to multiple pack-files and loose objects. + +In order to allow extensions that add extra data to the MIDX, we organize +the body into "chunks" and provide a lookup table at the beginning of the +body. The header includes certain length values, such as the number of packs, +the number of base MIDX files, hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'M', 'I', 'D', 'X'} + + 1-byte version number: + Git only writes or recognizes version 1. + + 1-byte Object Id Version + Git only writes or recognizes version 1 (SHA1). + + 1-byte number of "chunks" + + 1-byte number of base multi-pack-index files: + This value is currently always zero. + + 4-byte number of pack files + +CHUNK LOOKUP: + + (C + 1) * 12 bytes providing the chunk offsets: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are provided in file-order, so you can infer the length + using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + Packfile Names (ID: {'P', 'N', 'A', 'M'}) + Stores the packfile names as concatenated, null-terminated strings. + Packfiles must be listed in lexicographic order for fast lookups by + name. This is the only chunk not guaranteed to be a multiple of four + bytes in length, so should be the last chunk for alignment reasons. + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of objects. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) + The OIDs for all objects in the MIDX are stored in lexicographic + order in this chunk. + + Object Offsets (ID: {'O', 'O', 'F', 'F'}) + Stores two 4-byte values for every object. + 1: The pack-int-id for the pack storing this object. + 2: The offset within the pack. + If all offsets are less than 2^31, then the large offset chunk + will not exist and offsets are stored as in IDX v1. + If there is at least one offset value larger than 2^32-1, then + the large offset chunk must exist. If the large offset chunk + exists and the 31st bit is on, then removing that bit reveals + the row in the large offsets containing the 8-byte offset of + this object. + + [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) + 8-byte offsets into large packfiles. + +TRAILER: + + 20-byte SHA1-checksum of the above contents. diff --git a/Documentation/technical/pack-heuristics.txt b/Documentation/technical/pack-heuristics.txt new file mode 100644 index 000000000000..95a07db6e82b --- /dev/null +++ b/Documentation/technical/pack-heuristics.txt @@ -0,0 +1,460 @@ +Concerning Git's Packing Heuristics +=================================== + + Oh, here's a really stupid question: + + Where do I go + to learn the details + of Git's packing heuristics? + +Be careful what you ask! + +Followers of the Git, please open the Git IRC Log and turn to +February 10, 2006. + +It's a rare occasion, and we are joined by the King Git Himself, +Linus Torvalds (linus). Nathaniel Smith, (njs`), has the floor +and seeks enlightenment. Others are present, but silent. + +Let's listen in! + + <njs`> Oh, here's a really stupid question -- where do I go to + learn the details of Git's packing heuristics? google avails + me not, reading the source didn't help a lot, and wading + through the whole mailing list seems less efficient than any + of that. + +It is a bold start! A plea for help combined with a simultaneous +tri-part attack on some of the tried and true mainstays in the quest +for enlightenment. Brash accusations of google being useless. Hubris! +Maligning the source. Heresy! Disdain for the mailing list archives. +Woe. + + <pasky> yes, the packing-related delta stuff is somewhat + mysterious even for me ;) + +Ah! Modesty after all. + + <linus> njs, I don't think the docs exist. That's something where + I don't think anybody else than me even really got involved. + Most of the rest of Git others have been busy with (especially + Junio), but packing nobody touched after I did it. + +It's cryptic, yet vague. Linus in style for sure. Wise men +interpret this as an apology. A few argue it is merely a +statement of fact. + + <njs`> I guess the next step is "read the source again", but I + have to build up a certain level of gumption first :-) + +Indeed! On both points. + + <linus> The packing heuristic is actually really really simple. + +Bait... + + <linus> But strange. + +And switch. That ought to do it! + + <linus> Remember: Git really doesn't follow files. So what it does is + - generate a list of all objects + - sort the list according to magic heuristics + - walk the list, using a sliding window, seeing if an object + can be diffed against another object in the window + - write out the list in recency order + +The traditional understatement: + + <njs`> I suspect that what I'm missing is the precise definition of + the word "magic" + +The traditional insight: + + <pasky> yes + +And Babel-like confusion flowed. + + <njs`> oh, hmm, and I'm not sure what this sliding window means either + + <pasky> iirc, it appeared to me to be just the sha1 of the object + when reading the code casually ... + + ... which simply doesn't sound as a very good heuristics, though ;) + + <njs`> .....and recency order. okay, I think it's clear I didn't + even realize how much I wasn't realizing :-) + +Ah, grasshopper! And thus the enlightenment begins anew. + + <linus> The "magic" is actually in theory totally arbitrary. + ANY order will give you a working pack, but no, it's not + ordered by SHA-1. + + Before talking about the ordering for the sliding delta + window, let's talk about the recency order. That's more + important in one way. + + <njs`> Right, but if all you want is a working way to pack things + together, you could just use cat and save yourself some + trouble... + +Waaait for it.... + + <linus> The recency ordering (which is basically: put objects + _physically_ into the pack in the order that they are + "reachable" from the head) is important. + + <njs`> okay + + <linus> It's important because that's the thing that gives packs + good locality. It keeps the objects close to the head (whether + they are old or new, but they are _reachable_ from the head) + at the head of the pack. So packs actually have absolutely + _wonderful_ IO patterns. + +Read that again, because it is important. + + <linus> But recency ordering is totally useless for deciding how + to actually generate the deltas, so the delta ordering is + something else. + + The delta ordering is (wait for it): + - first sort by the "basename" of the object, as defined by + the name the object was _first_ reached through when + generating the object list + - within the same basename, sort by size of the object + - but always sort different types separately (commits first). + + That's not exactly it, but it's very close. + + <njs`> The "_first_ reached" thing is not too important, just you + need some way to break ties since the same objects may be + reachable many ways, yes? + +And as if to clarify: + + <linus> The point is that it's all really just any random + heuristic, and the ordering is totally unimportant for + correctness, but it helps a lot if the heuristic gives + "clumping" for things that are likely to delta well against + each other. + +It is an important point, so secretly, I did my own research and have +included my results below. To be fair, it has changed some over time. +And through the magic of Revisionistic History, I draw upon this entry +from The Git IRC Logs on my father's birthday, March 1: + + <gitster> The quote from the above linus should be rewritten a + bit (wait for it): + - first sort by type. Different objects never delta with + each other. + - then sort by filename/dirname. hash of the basename + occupies the top BITS_PER_INT-DIR_BITS bits, and bottom + DIR_BITS are for the hash of leading path elements. + - then if we are doing "thin" pack, the objects we are _not_ + going to pack but we know about are sorted earlier than + other objects. + - and finally sort by size, larger to smaller. + +In one swell-foop, clarification and obscurification! Nonetheless, +authoritative. Cryptic, yet concise. It even solicits notions of +quotes from The Source Code. Clearly, more study is needed. + + <gitster> That's the sort order. What this means is: + - we do not delta different object types. + - we prefer to delta the objects with the same full path, but + allow files with the same name from different directories. + - we always prefer to delta against objects we are not going + to send, if there are some. + - we prefer to delta against larger objects, so that we have + lots of removals. + + The penultimate rule is for "thin" packs. It is used when + the other side is known to have such objects. + +There it is again. "Thin" packs. I'm thinking to myself, "What +is a 'thin' pack?" So I ask: + + <jdl> What is a "thin" pack? + + <gitster> Use of --objects-edge to rev-list as the upstream of + pack-objects. The pack transfer protocol negotiates that. + +Woo hoo! Cleared that _right_ up! + + <gitster> There are two directions - push and fetch. + +There! Did you see it? It is not '"push" and "pull"'! How often the +confusion has started here. So casually mentioned, too! + + <gitster> For push, git-send-pack invokes git-receive-pack on the + other end. The receive-pack says "I have up to these commits". + send-pack looks at them, and computes what are missing from + the other end. So "thin" could be the default there. + + In the other direction, fetch, git-fetch-pack and + git-clone-pack invokes git-upload-pack on the other end + (via ssh or by talking to the daemon). + + There are two cases: fetch-pack with -k and clone-pack is one, + fetch-pack without -k is the other. clone-pack and fetch-pack + with -k will keep the downloaded packfile without expanded, so + we do not use thin pack transfer. Otherwise, the generated + pack will have delta without base object in the same pack. + + But fetch-pack without -k will explode the received pack into + individual objects, so we automatically ask upload-pack to + give us a thin pack if upload-pack supports it. + +OK then. + +Uh. + +Let's return to the previous conversation still in progress. + + <njs`> and "basename" means something like "the tail of end of + path of file objects and dir objects, as per basename(3), and + we just declare all commit and tag objects to have the same + basename" or something? + +Luckily, that too is a point that gitster clarified for us! + +If I might add, the trick is to make files that _might_ be similar be +located close to each other in the hash buckets based on their file +names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and +"Makefile" all landed in the same bucket due to their common basename, +"Makefile". However, now they land in "close" buckets. + +The algorithm allows not just for the _same_ bucket, but for _close_ +buckets to be considered delta candidates. The rationale is +essentially that files, like Makefiles, often have very similar +content no matter what directory they live in. + + <linus> I played around with different delta algorithms, and with + making the "delta window" bigger, but having too big of a + sliding window makes it very expensive to generate the pack: + you need to compare every object with a _ton_ of other objects. + + There are a number of other trivial heuristics too, which + basically boil down to "don't bother even trying to delta this + pair" if we can tell before-hand that the delta isn't worth it + (due to size differences, where we can take a previous delta + result into account to decide that "ok, no point in trying + that one, it will be worse"). + + End result: packing is actually very size efficient. It's + somewhat CPU-wasteful, but on the other hand, since you're + really only supposed to do it maybe once a month (and you can + do it during the night), nobody really seems to care. + +Nice Engineering Touch, there. Find when it doesn't matter, and +proclaim it a non-issue. Good style too! + + <njs`> So, just to repeat to see if I'm following, we start by + getting a list of the objects we want to pack, we sort it by + this heuristic (basically lexicographically on the tuple + (type, basename, size)). + + Then we walk through this list, and calculate a delta of + each object against the last n (tunable parameter) objects, + and pick the smallest of these deltas. + +Vastly simplified, but the essence is there! + + <linus> Correct. + + <njs`> And then once we have picked a delta or fulltext to + represent each object, we re-sort by recency, and write them + out in that order. + + <linus> Yup. Some other small details: + +And of course there is the "Other Shoe" Factor too. + + <linus> - We limit the delta depth to another magic value (right + now both the window and delta depth magic values are just "10") + + <njs`> Hrm, my intuition is that you'd end up with really _bad_ IO + patterns, because the things you want are near by, but to + actually reconstruct them you may have to jump all over in + random ways. + + <linus> - When we write out a delta, and we haven't yet written + out the object it is a delta against, we write out the base + object first. And no, when we reconstruct them, we actually + get nice IO patterns, because: + - larger objects tend to be "more recent" (Linus' law: files grow) + - we actively try to generate deltas from a larger object to a + smaller one + - this means that the top-of-tree very seldom has deltas + (i.e. deltas in _practice_ are "backwards deltas") + +Again, we should reread that whole paragraph. Not just because +Linus has slipped Linus's Law in there on us, but because it is +important. Let's make sure we clarify some of the points here: + + <njs`> So the point is just that in practice, delta order and + recency order match each other quite well. + + <linus> Yes. There's another nice side to this (and yes, it was + designed that way ;): + - the reason we generate deltas against the larger object is + actually a big space saver too! + + <njs`> Hmm, but your last comment (if "we haven't yet written out + the object it is a delta against, we write out the base object + first"), seems like it would make these facts mostly + irrelevant because even if in practice you would not have to + wander around much, in fact you just brute-force say that in + the cases where you might have to wander, don't do that :-) + + <linus> Yes and no. Notice the rule: we only write out the base + object first if the delta against it was more recent. That + means that you can actually have deltas that refer to a base + object that is _not_ close to the delta object, but that only + happens when the delta is needed to generate an _old_ object. + + <linus> See? + +Yeah, no. I missed that on the first two or three readings myself. + + <linus> This keeps the front of the pack dense. The front of the + pack never contains data that isn't relevant to a "recent" + object. The size optimization comes from our use of xdelta + (but is true for many other delta algorithms): removing data + is cheaper (in size) than adding data. + + When you remove data, you only need to say "copy bytes n--m". + In contrast, in a delta that _adds_ data, you have to say "add + these bytes: 'actual data goes here'" + + *** njs` has quit: Read error: 104 (Connection reset by peer) + + <linus> Uhhuh. I hope I didn't blow njs` mind. + + *** njs` has joined channel #git + + <pasky> :) + +The silent observers are amused. Of course. + +And as if njs` was expected to be omniscient: + + <linus> njs - did you miss anything? + +OK, I'll spell it out. That's Geek Humor. If njs` was not actually +connected for a little bit there, how would he know if missed anything +while he was disconnected? He's a benevolent dictator with a sense of +humor! Well noted! + + <njs`> Stupid router. Or gremlins, or whatever. + +It's a cheap shot at Cisco. Take 'em when you can. + + <njs`> Yes and no. Notice the rule: we only write out the base + object first if the delta against it was more recent. + + I'm getting lost in all these orders, let me re-read :-) + So the write-out order is from most recent to least recent? + (Conceivably it could be the opposite way too, I'm not sure if + we've said) though my connection back at home is logging, so I + can just read what you said there :-) + +And for those of you paying attention, the Omniscient Trick has just +been detailed! + + <linus> Yes, we always write out most recent first + + <njs`> And, yeah, I got the part about deeper-in-history stuff + having worse IO characteristics, one sort of doesn't care. + + <linus> With the caveat that if the "most recent" needs an older + object to delta against (hey, shrinking sometimes does + happen), we write out the old object with the delta. + + <njs`> (if only it happened more...) + + <linus> Anyway, the pack-file could easily be denser still, but + because it's used both for streaming (the Git protocol) and + for on-disk, it has a few pessimizations. + +Actually, it is a made-up word. But it is a made-up word being +used as setup for a later optimization, which is a real word: + + <linus> In particular, while the pack-file is then compressed, + it's compressed just one object at a time, so the actual + compression factor is less than it could be in theory. But it + means that it's all nice random-access with a simple index to + do "object name->location in packfile" translation. + + <njs`> I'm assuming the real win for delta-ing large->small is + more homogeneous statistics for gzip to run over? + + (You have to put the bytes in one place or another, but + putting them in a larger blob wins on compression) + + Actually, what is the compression strategy -- each delta + individually gzipped, the whole file gzipped, somewhere in + between, no compression at all, ....? + + Right. + +Reality IRC sets in. For example: + + <pasky> I'll read the rest in the morning, I really have to go + sleep or there's no hope whatsoever for me at the today's + exam... g'nite all. + +Heh. + + <linus> pasky: g'nite + + <njs`> pasky: 'luck + + <linus> Right: large->small matters exactly because of compression + behaviour. If it was non-compressed, it probably wouldn't make + any difference. + + <njs`> yeah + + <linus> Anyway: I'm not even trying to claim that the pack-files + are perfect, but they do tend to have a nice balance of + density vs ease-of use. + +Gasp! OK, saved. That's a fair Engineering trade off. Close call! +In fact, Linus reflects on some Basic Engineering Fundamentals, +design options, etc. + + <linus> More importantly, they allow Git to still _conceptually_ + never deal with deltas at all, and be a "whole object" store. + + Which has some problems (we discussed bad huge-file + behaviour on the Git lists the other day), but it does mean + that the basic Git concepts are really really simple and + straightforward. + + It's all been quite stable. + + Which I think is very much a result of having very simple + basic ideas, so that there's never any confusion about what's + going on. + + Bugs happen, but they are "simple" bugs. And bugs that + actually get some object store detail wrong are almost always + so obvious that they never go anywhere. + + <njs`> Yeah. + +Nuff said. + + <linus> Anyway. I'm off for bed. It's not 6AM here, but I've got + three kids, and have to get up early in the morning to send + them off. I need my beauty sleep. + + <njs`> :-) + + <njs`> appreciate the infodump, I really was failing to find the + details on Git packs :-) + +And now you know the rest of the story. diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt new file mode 100644 index 000000000000..c73e72de0e9c --- /dev/null +++ b/Documentation/technical/pack-protocol.txt @@ -0,0 +1,674 @@ +Packfile transfer protocols +=========================== + +Git supports transferring data in packfiles over the ssh://, git://, http:// and +file:// transports. There exist two sets of protocols, one for pushing +data from a client to a server and another for fetching data from a +server to a client. The three transports (ssh, git, file) use the same +protocol to transfer data. http is documented in http-protocol.txt. + +The processes invoked in the canonical Git implementation are 'upload-pack' +on the server side and 'fetch-pack' on the client side for fetching data; +then 'receive-pack' on the server and 'send-pack' on the client for pushing +data. The protocol functions to have a server tell a client what is +currently on the server, then for the two to negotiate the smallest amount +of data to send in order to fully update one or the other. + +pkt-line Format +--------------- + +The descriptions below build on the pkt-line format described in +protocol-common.txt. When the grammar indicate `PKT-LINE(...)`, unless +otherwise noted the usual pkt-line LF rules apply: the sender SHOULD +include a LF, but the receiver MUST NOT complain if it is not present. + +An error packet is a special pkt-line that contains an error string. + +---- + error-line = PKT-LINE("ERR" SP explanation-text) +---- + +Throughout the protocol, where `PKT-LINE(...)` is expected, an error packet MAY +be sent. Once this packet is sent by a client or a server, the data transfer +process defined in this protocol is terminated. + +Transports +---------- +There are three transports over which the packfile protocol is +initiated. The Git transport is a simple, unauthenticated server that +takes the command (almost always 'upload-pack', though Git +servers can be configured to be globally writable, in which 'receive- +pack' initiation is also allowed) with which the client wishes to +communicate and executes it and connects it to the requesting +process. + +In the SSH transport, the client just runs the 'upload-pack' +or 'receive-pack' process on the server over the SSH protocol and then +communicates with that invoked process over the SSH connection. + +The file:// transport runs the 'upload-pack' or 'receive-pack' +process locally and communicates with it over a pipe. + +Extra Parameters +---------------- + +The protocol provides a mechanism in which clients can send additional +information in its first message to the server. These are called "Extra +Parameters", and are supported by the Git, SSH, and HTTP protocols. + +Each Extra Parameter takes the form of `<key>=<value>` or `<key>`. + +Servers that receive any such Extra Parameters MUST ignore all +unrecognized keys. Currently, the only Extra Parameter recognized is +"version" with a value of '1' or '2'. See protocol-v2.txt for more +information on protocol version 2. + +Git Transport +------------- + +The Git transport starts off by sending the command and repository +on the wire using the pkt-line format, followed by a NUL byte and a +hostname parameter, terminated by a NUL byte. + + 0033git-upload-pack /project.git\0host=myserver.com\0 + +The transport may send Extra Parameters by adding an additional NUL +byte, and then adding one or more NUL-terminated strings: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0 + +-- + git-proto-request = request-command SP pathname NUL + [ host-parameter NUL ] [ NUL extra-parameters ] + request-command = "git-upload-pack" / "git-receive-pack" / + "git-upload-archive" ; case sensitive + pathname = *( %x01-ff ) ; exclude NUL + host-parameter = "host=" hostname [ ":" port ] + extra-parameters = 1*extra-parameter + extra-parameter = 1*( %x01-ff ) NUL +-- + +host-parameter is used for the +git-daemon name based virtual hosting. See --interpolated-path +option to git daemon, with the %H/%CH format characters. + +Basically what the Git client is doing to connect to an 'upload-pack' +process on the server side over the Git protocol is this: + + $ echo -e -n \ + "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" | + nc -v example.com 9418 + + +SSH Transport +------------- + +Initiating the upload-pack or receive-pack processes over SSH is +executing the binary on the server via SSH remote execution. +It is basically equivalent to running this: + + $ ssh git.example.com "git-upload-pack '/project.git'" + +For a server to support Git pushing and pulling for a given user over +SSH, that user needs to be able to execute one or both of those +commands via the SSH shell that they are provided on login. On some +systems, that shell access is limited to only being able to run those +two commands, or even just one of them. + +In an ssh:// format URI, it's absolute in the URI, so the '/' after +the host name (or port number) is sent as an argument, which is then +read by the remote git-upload-pack exactly as is, so it's effectively +an absolute path in the remote filesystem. + + git clone ssh://user@example.com/project.git + | + v + ssh user@example.com "git-upload-pack '/project.git'" + +In a "user@host:path" format URI, its relative to the user's home +directory, because the Git client will run: + + git clone user@example.com:project.git + | + v + ssh user@example.com "git-upload-pack 'project.git'" + +The exception is if a '~' is used, in which case +we execute it without the leading '/'. + + ssh://user@example.com/~alice/project.git, + | + v + ssh user@example.com "git-upload-pack '~alice/project.git'" + +Depending on the value of the `protocol.version` configuration variable, +Git may attempt to send Extra Parameters as a colon-separated string in +the GIT_PROTOCOL environment variable. This is done only if +the `ssh.variant` configuration variable indicates that the ssh command +supports passing environment variables as an argument. + +A few things to remember here: + +- The "command name" is spelled with dash (e.g. git-upload-pack), but + this can be overridden by the client; + +- The repository path is always quoted with single quotes. + +Fetching Data From a Server +--------------------------- + +When one Git repository wants to get data that a second repository +has, the first can 'fetch' from the second. This operation determines +what data the server has that the client does not then streams that +data down to the client in packfile format. + + +Reference Discovery +------------------- + +When the client initially connects the server will immediately respond +with a version number (if "version=1" is sent as an Extra Parameter), +and a listing of each reference it has (all branches and tags) along +with the object name that each reference currently points to. + + $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" | + nc -v example.com 9418 + 000aversion 1 + 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack + side-band side-band-64k ofs-delta shallow no-progress include-tag + 00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration + 003f7217a7c7e582c46cec22a130adf4b9d7d950fba0 refs/heads/master + 003cb88d2441cac0977faf98efc80305012112238d9d refs/tags/v0.9 + 003c525128480b96c89e6418b1e40909bf6c5b2d580f refs/tags/v1.0 + 003fe92df48743b7bc7d26bcaabfddde0a1e20cae47c refs/tags/v1.0^{} + 0000 + +The returned response is a pkt-line stream describing each ref and +its current value. The stream MUST be sorted by name according to +the C locale ordering. + +If HEAD is a valid ref, HEAD MUST appear as the first advertised +ref. If HEAD is not a valid ref, HEAD MUST NOT appear in the +advertisement list at all, but other refs may still appear. + +The stream MUST include capability declarations behind a NUL on the +first ref. The peeled value of a ref (that is "ref^{}") MUST be +immediately after the ref itself, if presented. A conforming server +MUST peel the ref if it's an annotated tag. + +---- + advertised-refs = *1("version 1") + (no-refs / list-of-refs) + *shallow + flush-pkt + + no-refs = PKT-LINE(zero-id SP "capabilities^{}" + NUL capability-list) + + list-of-refs = first-ref *other-ref + first-ref = PKT-LINE(obj-id SP refname + NUL capability-list) + + other-ref = PKT-LINE(other-tip / other-peeled) + other-tip = obj-id SP refname + other-peeled = obj-id SP refname "^{}" + + shallow = PKT-LINE("shallow" SP obj-id) + + capability-list = capability *(SP capability) + capability = 1*(LC_ALPHA / DIGIT / "-" / "_") + LC_ALPHA = %x61-7A +---- + +Server and client MUST use lowercase for obj-id, both MUST treat obj-id +as case-insensitive. + +See protocol-capabilities.txt for a list of allowed server capabilities +and descriptions. + +Packfile Negotiation +-------------------- +After reference and capabilities discovery, the client can decide to +terminate the connection by sending a flush-pkt, telling the server it can +now gracefully terminate, and disconnect, when it does not need any pack +data. This can happen with the ls-remote command, and also can happen when +the client already is up to date. + +Otherwise, it enters the negotiation phase, where the client and +server determine what the minimal packfile necessary for transport is, +by telling the server what objects it wants, its shallow objects +(if any), and the maximum commit depth it wants (if any). The client +will also send a list of the capabilities it wants to be in effect, +out of what the server said it could do with the first 'want' line. + +---- + upload-request = want-list + *shallow-line + *1depth-request + [filter-request] + flush-pkt + + want-list = first-want + *additional-want + + shallow-line = PKT-LINE("shallow" SP obj-id) + + depth-request = PKT-LINE("deepen" SP depth) / + PKT-LINE("deepen-since" SP timestamp) / + PKT-LINE("deepen-not" SP ref) + + first-want = PKT-LINE("want" SP obj-id SP capability-list) + additional-want = PKT-LINE("want" SP obj-id) + + depth = 1*DIGIT + + filter-request = PKT-LINE("filter" SP filter-spec) +---- + +Clients MUST send all the obj-ids it wants from the reference +discovery phase as 'want' lines. Clients MUST send at least one +'want' command in the request body. Clients MUST NOT mention an +obj-id in a 'want' command which did not appear in the response +obtained through ref discovery. + +The client MUST write all obj-ids which it only has shallow copies +of (meaning that it does not have the parents of a commit) as +'shallow' lines so that the server is aware of the limitations of +the client's history. + +The client now sends the maximum commit history depth it wants for +this transaction, which is the number of commits it wants from the +tip of the history, if any, as a 'deepen' line. A depth of 0 is the +same as not making a depth request. The client does not want to receive +any commits beyond this depth, nor does it want objects needed only to +complete those commits. Commits whose parents are not received as a +result are defined as shallow and marked as such in the server. This +information is sent back to the client in the next step. + +The client can optionally request that pack-objects omit various +objects from the packfile using one of several filtering techniques. +These are intended for use with partial clone and partial fetch +operations. An object that does not meet a filter-spec value is +omitted unless explicitly requested in a 'want' line. See `rev-list` +for possible filter-spec values. + +Once all the 'want's and 'shallow's (and optional 'deepen') are +transferred, clients MUST send a flush-pkt, to tell the server side +that it is done sending the list. + +Otherwise, if the client sent a positive depth request, the server +will determine which commits will and will not be shallow and +send this information to the client. If the client did not request +a positive depth, this step is skipped. + +---- + shallow-update = *shallow-line + *unshallow-line + flush-pkt + + shallow-line = PKT-LINE("shallow" SP obj-id) + + unshallow-line = PKT-LINE("unshallow" SP obj-id) +---- + +If the client has requested a positive depth, the server will compute +the set of commits which are no deeper than the desired depth. The set +of commits start at the client's wants. + +The server writes 'shallow' lines for each +commit whose parents will not be sent as a result. The server writes +an 'unshallow' line for each commit which the client has indicated is +shallow, but is no longer shallow at the currently requested depth +(that is, its parents will now be sent). The server MUST NOT mark +as unshallow anything which the client has not indicated was shallow. + +Now the client will send a list of the obj-ids it has using 'have' +lines, so the server can make a packfile that only contains the objects +that the client needs. In multi_ack mode, the canonical implementation +will send up to 32 of these at a time, then will send a flush-pkt. The +canonical implementation will skip ahead and send the next 32 immediately, +so that there is always a block of 32 "in-flight on the wire" at a time. + +---- + upload-haves = have-list + compute-end + + have-list = *have-line + have-line = PKT-LINE("have" SP obj-id) + compute-end = flush-pkt / PKT-LINE("done") +---- + +If the server reads 'have' lines, it then will respond by ACKing any +of the obj-ids the client said it had that the server also has. The +server will ACK obj-ids differently depending on which ack mode is +chosen by the client. + +In multi_ack mode: + + * the server will respond with 'ACK obj-id continue' for any common + commits. + + * once the server has found an acceptable common base commit and is + ready to make a packfile, it will blindly ACK all 'have' obj-ids + back to the client. + + * the server will then send a 'NAK' and then wait for another response + from the client - either a 'done' or another list of 'have' lines. + +In multi_ack_detailed mode: + + * the server will differentiate the ACKs where it is signaling + that it is ready to send data with 'ACK obj-id ready' lines, and + signals the identified common commits with 'ACK obj-id common' lines. + +Without either multi_ack or multi_ack_detailed: + + * upload-pack sends "ACK obj-id" on the first common object it finds. + After that it says nothing until the client gives it a "done". + + * upload-pack sends "NAK" on a flush-pkt if no common object + has been found yet. If one has been found, and thus an ACK + was already sent, it's silent on the flush-pkt. + +After the client has gotten enough ACK responses that it can determine +that the server has enough information to send an efficient packfile +(in the canonical implementation, this is determined when it has received +enough ACKs that it can color everything left in the --date-order queue +as common with the server, or the --date-order queue is empty), or the +client determines that it wants to give up (in the canonical implementation, +this is determined when the client sends 256 'have' lines without getting +any of them ACKed by the server - meaning there is nothing in common and +the server should just send all of its objects), then the client will send +a 'done' command. The 'done' command signals to the server that the client +is ready to receive its packfile data. + +However, the 256 limit *only* turns on in the canonical client +implementation if we have received at least one "ACK %s continue" +during a prior round. This helps to ensure that at least one common +ancestor is found before we give up entirely. + +Once the 'done' line is read from the client, the server will either +send a final 'ACK obj-id' or it will send a 'NAK'. 'obj-id' is the object +name of the last commit determined to be common. The server only sends +ACK after 'done' if there is at least one common base and multi_ack or +multi_ack_detailed is enabled. The server always sends NAK after 'done' +if there is no common base found. + +Instead of 'ACK' or 'NAK', the server may send an error message (for +example, if it does not recognize an object in a 'want' line received +from the client). + +Then the server will start sending its packfile data. + +---- + server-response = *ack_multi ack / nak + ack_multi = PKT-LINE("ACK" SP obj-id ack_status) + ack_status = "continue" / "common" / "ready" + ack = PKT-LINE("ACK" SP obj-id) + nak = PKT-LINE("NAK") +---- + +A simple clone may look like this (with no 'have' lines): + +---- + C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack \ + side-band-64k ofs-delta\n + C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n + C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n + C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n + C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n + C: 0000 + C: 0009done\n + + S: 0008NAK\n + S: [PACKFILE] +---- + +An incremental update (fetch) response might look like this: + +---- + C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack \ + side-band-64k ofs-delta\n + C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n + C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n + C: 0000 + C: 0032have 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n + C: [30 more have lines] + C: 0032have 74730d410fcb6603ace96f1dc55ea6196122532d\n + C: 0000 + + S: 003aACK 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 continue\n + S: 003aACK 74730d410fcb6603ace96f1dc55ea6196122532d continue\n + S: 0008NAK\n + + C: 0009done\n + + S: 0031ACK 74730d410fcb6603ace96f1dc55ea6196122532d\n + S: [PACKFILE] +---- + + +Packfile Data +------------- + +Now that the client and server have finished negotiation about what +the minimal amount of data that needs to be sent to the client is, the server +will construct and send the required data in packfile format. + +See pack-format.txt for what the packfile itself actually looks like. + +If 'side-band' or 'side-band-64k' capabilities have been specified by +the client, the server will send the packfile data multiplexed. + +Each packet starting with the packet-line length of the amount of data +that follows, followed by a single byte specifying the sideband the +following data is coming in on. + +In 'side-band' mode, it will send up to 999 data bytes plus 1 control +code, for a total of up to 1000 bytes in a pkt-line. In 'side-band-64k' +mode it will send up to 65519 data bytes plus 1 control code, for a +total of up to 65520 bytes in a pkt-line. + +The sideband byte will be a '1', '2' or a '3'. Sideband '1' will contain +packfile data, sideband '2' will be used for progress information that the +client will generally print to stderr and sideband '3' is used for error +information. + +If no 'side-band' capability was specified, the server will stream the +entire packfile without multiplexing. + + +Pushing Data To a Server +------------------------ + +Pushing data to a server will invoke the 'receive-pack' process on the +server, which will allow the client to tell it which references it should +update and then send all the data the server will need for those new +references to be complete. Once all the data is received and validated, +the server will then update its references to what the client specified. + +Authentication +-------------- + +The protocol itself contains no authentication mechanisms. That is to be +handled by the transport, such as SSH, before the 'receive-pack' process is +invoked. If 'receive-pack' is configured over the Git transport, those +repositories will be writable by anyone who can access that port (9418) as +that transport is unauthenticated. + +Reference Discovery +------------------- + +The reference discovery phase is done nearly the same way as it is in the +fetching protocol. Each reference obj-id and name on the server is sent +in packet-line format to the client, followed by a flush-pkt. The only +real difference is that the capability listing is different - the only +possible values are 'report-status', 'delete-refs', 'ofs-delta' and +'push-options'. + +Reference Update Request and Packfile Transfer +---------------------------------------------- + +Once the client knows what references the server is at, it can send a +list of reference update requests. For each reference on the server +that it wants to update, it sends a line listing the obj-id currently on +the server, the obj-id the client would like to update it to and the name +of the reference. + +This list is followed by a flush-pkt. + +---- + update-requests = *shallow ( command-list | push-cert ) + + shallow = PKT-LINE("shallow" SP obj-id) + + command-list = PKT-LINE(command NUL capability-list) + *PKT-LINE(command) + flush-pkt + + command = create / delete / update + create = zero-id SP new-id SP name + delete = old-id SP zero-id SP name + update = old-id SP new-id SP name + + old-id = obj-id + new-id = obj-id + + push-cert = PKT-LINE("push-cert" NUL capability-list LF) + PKT-LINE("certificate version 0.1" LF) + PKT-LINE("pusher" SP ident LF) + PKT-LINE("pushee" SP url LF) + PKT-LINE("nonce" SP nonce LF) + *PKT-LINE("push-option" SP push-option LF) + PKT-LINE(LF) + *PKT-LINE(command LF) + *PKT-LINE(gpg-signature-lines LF) + PKT-LINE("push-cert-end" LF) + + push-option = 1*( VCHAR | SP ) +---- + +If the server has advertised the 'push-options' capability and the client has +specified 'push-options' as part of the capability list above, the client then +sends its push options followed by a flush-pkt. + +---- + push-options = *PKT-LINE(push-option) flush-pkt +---- + +For backwards compatibility with older Git servers, if the client sends a push +cert and push options, it MUST send its push options both embedded within the +push cert and after the push cert. (Note that the push options within the cert +are prefixed, but the push options after the cert are not.) Both these lists +MUST be the same, modulo the prefix. + +After that the packfile that +should contain all the objects that the server will need to complete the new +references will be sent. + +---- + packfile = "PACK" 28*(OCTET) +---- + +If the receiving end does not support delete-refs, the sending end MUST +NOT ask for delete command. + +If the receiving end does not support push-cert, the sending end +MUST NOT send a push-cert command. When a push-cert command is +sent, command-list MUST NOT be sent; the commands recorded in the +push certificate is used instead. + +The packfile MUST NOT be sent if the only command used is 'delete'. + +A packfile MUST be sent if either create or update command is used, +even if the server already has all the necessary objects. In this +case the client MUST send an empty packfile. The only time this +is likely to happen is if the client is creating +a new branch or a tag that points to an existing obj-id. + +The server will receive the packfile, unpack it, then validate each +reference that is being updated that it hasn't changed while the request +was being processed (the obj-id is still the same as the old-id), and +it will run any update hooks to make sure that the update is acceptable. +If all of that is fine, the server will then update the references. + +Push Certificate +---------------- + +A push certificate begins with a set of header lines. After the +header and an empty line, the protocol commands follow, one per +line. Note that the trailing LF in push-cert PKT-LINEs is _not_ +optional; it must be present. + +Currently, the following header fields are defined: + +`pusher` ident:: + Identify the GPG key in "Human Readable Name <email@address>" + format. + +`pushee` url:: + The repository URL (anonymized, if the URL contains + authentication material) the user who ran `git push` + intended to push into. + +`nonce` nonce:: + The 'nonce' string the receiving repository asked the + pushing user to include in the certificate, to prevent + replay attacks. + +The GPG signature lines are a detached signature for the contents +recorded in the push certificate before the signature block begins. +The detached signature is used to certify that the commands were +given by the pusher, who must be the signer. + +Report Status +------------- + +After receiving the pack data from the sender, the receiver sends a +report if 'report-status' capability is in effect. +It is a short listing of what happened in that update. It will first +list the status of the packfile unpacking as either 'unpack ok' or +'unpack [error]'. Then it will list the status for each of the references +that it tried to update. Each line is either 'ok [refname]' if the +update was successful, or 'ng [refname] [error]' if the update was not. + +---- + report-status = unpack-status + 1*(command-status) + flush-pkt + + unpack-status = PKT-LINE("unpack" SP unpack-result) + unpack-result = "ok" / error-msg + + command-status = command-ok / command-fail + command-ok = PKT-LINE("ok" SP refname) + command-fail = PKT-LINE("ng" SP refname SP error-msg) + + error-msg = 1*(OCTECT) ; where not "ok" +---- + +Updates can be unsuccessful for a number of reasons. The reference can have +changed since the reference discovery phase was originally sent, meaning +someone pushed in the meantime. The reference being pushed could be a +non-fast-forward reference and the update hooks or configuration could be +set to not allow that, etc. Also, some references can be updated while others +can be rejected. + +An example client/server communication might look like this: + +---- + S: 006274730d410fcb6603ace96f1dc55ea6196122532d refs/heads/local\0report-status delete-refs ofs-delta\n + S: 003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\n + S: 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\n + S: 003d74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/team\n + S: 0000 + + C: 00677d1665144a3a975c05f1f43902ddaf084e784dbe 74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/debug\n + C: 006874730d410fcb6603ace96f1dc55ea6196122532d 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/master\n + C: 0000 + C: [PACKDATA] + + S: 000eunpack ok\n + S: 0018ok refs/heads/debug\n + S: 002ang refs/heads/master non-fast-forward\n +---- diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt new file mode 100644 index 000000000000..896c7b387886 --- /dev/null +++ b/Documentation/technical/partial-clone.txt @@ -0,0 +1,324 @@ +Partial Clone Design Notes +========================== + +The "Partial Clone" feature is a performance optimization for Git that +allows Git to function without having a complete copy of the repository. +The goal of this work is to allow Git better handle extremely large +repositories. + +During clone and fetch operations, Git downloads the complete contents +and history of the repository. This includes all commits, trees, and +blobs for the complete life of the repository. For extremely large +repositories, clones can take hours (or days) and consume 100+GiB of disk +space. + +Often in these repositories there are many blobs and trees that the user +does not need such as: + + 1. files outside of the user's work area in the tree. For example, in + a repository with 500K directories and 3.5M files in every commit, + we can avoid downloading many objects if the user only needs a + narrow "cone" of the source tree. + + 2. large binary assets. For example, in a repository where large build + artifacts are checked into the tree, we can avoid downloading all + previous versions of these non-mergeable binary assets and only + download versions that are actually referenced. + +Partial clone allows us to avoid downloading such unneeded objects *in +advance* during clone and fetch operations and thereby reduce download +times and disk usage. Missing objects can later be "demand fetched" +if/when needed. + +Use of partial clone requires that the user be online and the origin +remote be available for on-demand fetching of missing objects. This may +or may not be problematic for the user. For example, if the user can +stay within the pre-selected subset of the source tree, they may not +encounter any missing objects. Alternatively, the user could try to +pre-fetch various objects if they know that they are going offline. + + +Non-Goals +--------- + +Partial clone is a mechanism to limit the number of blobs and trees downloaded +*within* a given range of commits -- and is therefore independent of and not +intended to conflict with existing DAG-level mechanisms to limit the set of +requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). + + +Design Overview +--------------- + +Partial clone logically consists of the following parts: + +- A mechanism for the client to describe unneeded or unwanted objects to + the server. + +- A mechanism for the server to omit such unwanted objects from packfiles + sent to the client. + +- A mechanism for the client to gracefully handle missing objects (that + were previously omitted by the server). + +- A mechanism for the client to backfill missing objects as needed. + + +Design Details +-------------- + +- A new pack-protocol capability "filter" is added to the fetch-pack and + upload-pack negotiation. ++ +This uses the existing capability discovery mechanism. +See "filter" in Documentation/technical/pack-protocol.txt. + +- Clients pass a "filter-spec" to clone and fetch which is passed to the + server to request filtering during packfile construction. ++ +There are various filters available to accommodate different situations. +See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. + +- On the server pack-objects applies the requested filter-spec as it + creates "filtered" packfiles for the client. ++ +These filtered packfiles are *incomplete* in the traditional sense because +they may contain objects that reference objects not contained in the +packfile and that the client doesn't already have. For example, the +filtered packfile may contain trees or tags that reference missing blobs +or commits that reference missing trees. + +- On the client these incomplete packfiles are marked as "promisor packfiles" + and treated differently by various commands. + +- On the client a repository extension is added to the local config to + prevent older versions of git from failing mid-operation because of + missing objects that they cannot handle. + See "extensions.partialClone" in Documentation/technical/repository-version.txt" + + +Handling Missing Objects +------------------------ + +- An object may be missing due to a partial clone or fetch, or missing due + to repository corruption. To differentiate these cases, the local + repository specially indicates such filtered packfiles obtained from the + promisor remote as "promisor packfiles". ++ +These promisor packfiles consist of a "<name>.promisor" file with +arbitrary contents (like the "<name>.keep" files), in addition to +their "<name>.pack" and "<name>.idx" files. + +- The local repository considers a "promisor object" to be an object that + it knows (to the best of its ability) that the promisor remote has promised + that it has, either because the local repository has that object in one of + its promisor packfiles, or because another promisor object refers to it. ++ +When Git encounters a missing object, Git can see if it is a promisor object +and handle it appropriately. If not, Git can report a corruption. ++ +This means that there is no need for the client to explicitly maintain an +expensive-to-modify list of missing objects.[a] + +- Since almost all Git code currently expects any referenced object to be + present locally and because we do not want to force every command to do + a dry-run first, a fallback mechanism is added to allow Git to attempt + to dynamically fetch missing objects from the promisor remote. ++ +When the normal object lookup fails to find an object, Git invokes +fetch-object to try to get the object from the server and then retry +the object lookup. This allows objects to be "faulted in" without +complicated prediction algorithms. ++ +For efficiency reasons, no check as to whether the missing object is +actually a promisor object is performed. ++ +Dynamic object fetching tends to be slow as objects are fetched one at +a time. + +- `checkout` (and any other command using `unpack-trees`) has been taught + to bulk pre-fetch all required missing blobs in a single batch. + +- `rev-list` has been taught to print missing objects. ++ +This can be used by other commands to bulk prefetch objects. +For example, a "git log -p A..B" may internally want to first do +something like "git rev-list --objects --quiet --missing=print A..B" +and prefetch those objects in bulk. + +- `fsck` has been updated to be fully aware of promisor objects. + +- `repack` in GC has been updated to not touch promisor packfiles at all, + and to only repack other objects. + +- The global variable "fetch_if_missing" is used to control whether an + object lookup will attempt to dynamically fetch a missing object or + report an error. ++ +We are not happy with this global variable and would like to remove it, +but that requires significant refactoring of the object code to pass an +additional flag. We hope that concurrent efforts to add an ODB API can +encompass this. + + +Fetching Missing Objects +------------------------ + +- Fetching of objects is done using the existing transport mechanism using + transport_fetch_refs(), setting a new transport option + TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are + desired, not any object that they refer to. ++ +Because some transports invoke fetch_pack() in the same process, fetch_pack() +has been updated to not use any object flags when the corresponding argument +(no_dependents) is set. + +- The local repository sends a request with the hashes of all requested + objects as "want" lines, and does not perform any packfile negotiation. + It then receives a packfile. + +- Because we are reusing the existing fetch-pack mechanism, fetching + currently fetches all objects referred to by the requested objects, even + though they are not necessary. + + +Current Limitations +------------------- + +- The remote used for a partial clone (or the first partial fetch + following a regular clone) is marked as the "promisor remote". ++ +We are currently limited to a single promisor remote and only that +remote may be used for subsequent partial fetches. ++ +We accept this limitation because we believe initial users of this +feature will be using it on repositories with a strong single central +server. + +- Dynamic object fetching will only ask the promisor remote for missing + objects. We assume that the promisor remote has a complete view of the + repository and can satisfy all such requests. + +- Repack essentially treats promisor and non-promisor packfiles as 2 + distinct partitions and does not mix them. Repack currently only works + on non-promisor packfiles and loose objects. + +- Dynamic object fetching invokes fetch-pack once *for each item* + because most algorithms stumble upon a missing object and need to have + it resolved before continuing their work. This may incur significant + overhead -- and multiple authentication requests -- if many objects are + needed. + +- Dynamic object fetching currently uses the existing pack protocol V0 + which means that each object is requested via fetch-pack. The server + will send a full set of info/refs when the connection is established. + If there are large number of refs, this may incur significant overhead. + + +Future Work +----------- + +- Allow more than one promisor remote and define a strategy for fetching + missing objects from specific promisor remotes or of iterating over the + set of promisor remotes until a missing object is found. ++ +A user might want to have multiple geographically-close cache servers +for fetching missing blobs while continuing to do filtered `git-fetch` +commands from the central server, for example. ++ +Or the user might want to work in a triangular work flow with multiple +promisor remotes that each have an incomplete view of the repository. + +- Allow repack to work on promisor packfiles (while keeping them distinct + from non-promisor packfiles). + +- Allow non-pathname-based filters to make use of packfile bitmaps (when + present). This was just an omission during the initial implementation. + +- Investigate use of a long-running process to dynamically fetch a series + of objects, such as proposed in [5,6] to reduce process startup and + overhead costs. ++ +It would be nice if pack protocol V2 could allow that long-running +process to make a series of requests over a single long-running +connection. + +- Investigate pack protocol V2 to avoid the info/refs broadcast on + each connection with the server to dynamically fetch missing objects. + +- Investigate the need to handle loose promisor objects. ++ +Objects in promisor packfiles are allowed to reference missing objects +that can be dynamically fetched from the server. An assumption was +made that loose objects are only created locally and therefore should +not reference a missing object. We may need to revisit that assumption +if, for example, we dynamically fetch a missing tree and store it as a +loose object rather than a single object packfile. ++ +This does not necessarily mean we need to mark loose objects as promisor; +it may be sufficient to relax the object lookup or is-promisor functions. + + +Non-Tasks +--------- + +- Every time the subject of "demand loading blobs" comes up it seems + that someone suggests that the server be allowed to "guess" and send + additional objects that may be related to the requested objects. ++ +No work has gone into actually doing that; we're just documenting that +it is a common suggestion. We're not sure how it would work and have +no plans to work on it. ++ +It is valid for the server to send more objects than requested (even +for a dynamic object fetch), but we are not building on that. + + +Footnotes +--------- + +[a] expensive-to-modify list of missing objects: Earlier in the design of + partial clone we discussed the need for a single list of missing objects. + This would essentially be a sorted linear list of OIDs that the were + omitted by the server during a clone or subsequent fetches. + +This file would need to be loaded into memory on every object lookup. +It would need to be read, updated, and re-written (like the .git/index) +on every explicit "git fetch" command *and* on any dynamic object fetch. + +The cost to read, update, and write this file could add significant +overhead to every command if there are many missing objects. For example, +if there are 100M missing blobs, this file would be at least 2GiB on disk. + +With the "promisor" concept, we *infer* a missing object based upon the +type of packfile that references it. + + +Related Links +------------- +[0] https://crbug.com/git/2 + Bug#2: Partial Clone + +[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + + Subject: [RFC] Add support for downloading blobs on demand + + Date: Fri, 13 Jan 2017 10:52:53 -0500 + +[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ + + Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + + Date: Fri, 29 Sep 2017 13:11:36 -0700 + +[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + + Subject: Proposal for missing blob support in Git repos + + Date: Wed, 26 Apr 2017 15:13:46 -0700 + +[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + + Subject: [PATCH 00/10] RFC Partial Clone and Fetch + + Date: Wed, 8 Mar 2017 18:50:29 +0000 + +[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + + Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + + Date: Fri, 5 May 2017 11:27:52 -0400 + +[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + + Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + + Date: Fri, 14 Jul 2017 09:26:50 -0400 diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt new file mode 100644 index 000000000000..2b267c0da6b2 --- /dev/null +++ b/Documentation/technical/protocol-capabilities.txt @@ -0,0 +1,337 @@ +Git Protocol Capabilities +========================= + +NOTE: this document describes capabilities for versions 0 and 1 of the pack +protocol. For version 2, please refer to the link:protocol-v2.html[protocol-v2] +doc. + +Servers SHOULD support all capabilities defined in this document. + +On the very first line of the initial server response of either +receive-pack and upload-pack the first reference is followed by +a NUL byte and then a list of space delimited server capabilities. +These allow the server to declare what it can and cannot support +to the client. + +Client will then send a space separated list of capabilities it wants +to be in effect. The client MUST NOT ask for capabilities the server +did not say it supports. + +Server MUST diagnose and abort if capabilities it does not understand +was sent. Server MUST NOT ignore capabilities that client requested +and server advertised. As a consequence of these rules, server MUST +NOT advertise capabilities it does not understand. + +The 'atomic', 'report-status', 'delete-refs', 'quiet', and 'push-cert' +capabilities are sent and recognized by the receive-pack (push to server) +process. + +The 'ofs-delta' and 'side-band-64k' capabilities are sent and recognized +by both upload-pack and receive-pack protocols. The 'agent' capability +may optionally be sent in both protocols. + +All other capabilities are only recognized by the upload-pack (fetch +from server) process. + +multi_ack +--------- + +The 'multi_ack' capability allows the server to return "ACK obj-id +continue" as soon as it finds a commit that it can use as a common +base, between the client's wants and the client's have set. + +By sending this early, the server can potentially head off the client +from walking any further down that particular branch of the client's +repository history. The client may still need to walk down other +branches, sending have lines for those, until the server has a +complete cut across the DAG, or the client has said "done". + +Without multi_ack, a client sends have lines in --date-order until +the server has found a common base. That means the client will send +have lines that are already known by the server to be common, because +they overlap in time with another branch that the server hasn't found +a common base on yet. + +For example suppose the client has commits in caps that the server +doesn't and the server has commits in lower case that the client +doesn't, as in the following diagram: + + +---- u ---------------------- x + / +----- y + / / + a -- b -- c -- d -- E -- F + \ + +--- Q -- R -- S + +If the client wants x,y and starts out by saying have F,S, the server +doesn't know what F,S is. Eventually the client says "have d" and +the server sends "ACK d continue" to let the client know to stop +walking down that line (so don't send c-b-a), but it's not done yet, +it needs a base for x. The client keeps going with S-R-Q, until a +gets reached, at which point the server has a clear base and it all +ends. + +Without multi_ack the client would have sent that c-b-a chain anyway, +interleaved with S-R-Q. + +multi_ack_detailed +------------------ +This is an extension of multi_ack that permits client to better +understand the server's in-memory state. See pack-protocol.txt, +section "Packfile Negotiation" for more information. + +no-done +------- +This capability should only be used with the smart HTTP protocol. If +multi_ack_detailed and no-done are both present, then the sender is +free to immediately send a pack following its first "ACK obj-id ready" +message. + +Without no-done in the smart HTTP protocol, the server session would +end and the client has to make another trip to send "done" before +the server can send the pack. no-done removes the last round and +thus slightly reduces latency. + +thin-pack +--------- + +A thin pack is one with deltas which reference base objects not +contained within the pack (but are known to exist at the receiving +end). This can reduce the network traffic significantly, but it +requires the receiving end to know how to "thicken" these packs by +adding the missing bases to the pack. + +The upload-pack server advertises 'thin-pack' when it can generate +and send a thin pack. A client requests the 'thin-pack' capability +when it understands how to "thicken" it, notifying the server that +it can receive such a pack. A client MUST NOT request the +'thin-pack' capability if it cannot turn a thin pack into a +self-contained pack. + +Receive-pack, on the other hand, is assumed by default to be able to +handle thin packs, but can ask the client not to use the feature by +advertising the 'no-thin' capability. A client MUST NOT send a thin +pack if the server advertises the 'no-thin' capability. + +The reasons for this asymmetry are historical. The receive-pack +program did not exist until after the invention of thin packs, so +historically the reference implementation of receive-pack always +understood thin packs. Adding 'no-thin' later allowed receive-pack +to disable the feature in a backwards-compatible manner. + + +side-band, side-band-64k +------------------------ + +This capability means that server can send, and client understand multiplexed +progress reports and error info interleaved with the packfile itself. + +These two options are mutually exclusive. A modern client always +favors 'side-band-64k'. + +Either mode indicates that the packfile data will be streamed broken +up into packets of up to either 1000 bytes in the case of 'side_band', +or 65520 bytes in the case of 'side_band_64k'. Each packet is made up +of a leading 4-byte pkt-line length of how much data is in the packet, +followed by a 1-byte stream code, followed by the actual data. + +The stream code can be one of: + + 1 - pack data + 2 - progress messages + 3 - fatal error message just before stream aborts + +The "side-band-64k" capability came about as a way for newer clients +that can handle much larger packets to request packets that are +actually crammed nearly full, while maintaining backward compatibility +for the older clients. + +Further, with side-band and its up to 1000-byte messages, it's actually +999 bytes of payload and 1 byte for the stream code. With side-band-64k, +same deal, you have up to 65519 bytes of data and 1 byte for the stream +code. + +The client MUST send only maximum of one of "side-band" and "side- +band-64k". Server MUST diagnose it as an error if client requests +both. + +ofs-delta +--------- + +Server can send, and client understand PACKv2 with delta referring to +its base by position in pack rather than by an obj-id. That is, they can +send/read OBJ_OFS_DELTA (aka type 6) in a packfile. + +agent +----- + +The server may optionally send a capability of the form `agent=X` to +notify the client that the server is running version `X`. The client may +optionally return its own agent string by responding with an `agent=Y` +capability (but it MUST NOT do so if the server did not mention the +agent capability). The `X` and `Y` strings may contain any printable +ASCII characters except space (i.e., the byte range 32 < x < 127), and +are typically of the form "package/version" (e.g., "git/1.8.3.1"). The +agent strings are purely informative for statistics and debugging +purposes, and MUST NOT be used to programmatically assume the presence +or absence of particular features. + +symref +------ + +This parameterized capability is used to inform the receiver which symbolic ref +points to which ref; for example, "symref=HEAD:refs/heads/master" tells the +receiver that HEAD points to master. This capability can be repeated to +represent multiple symrefs. + +Servers SHOULD include this capability for the HEAD symref if it is one of the +refs being sent. + +Clients MAY use the parameters from this capability to select the proper initial +branch when cloning a repository. + +shallow +------- + +This capability adds "deepen", "shallow" and "unshallow" commands to +the fetch-pack/upload-pack protocol so clients can request shallow +clones. + +deepen-since +------------ + +This capability adds "deepen-since" command to fetch-pack/upload-pack +protocol so the client can request shallow clones that are cut at a +specific time, instead of depth. Internally it's equivalent of doing +"rev-list --max-age=<timestamp>" on the server side. "deepen-since" +cannot be used with "deepen". + +deepen-not +---------- + +This capability adds "deepen-not" command to fetch-pack/upload-pack +protocol so the client can request shallow clones that are cut at a +specific revision, instead of depth. Internally it's equivalent of +doing "rev-list --not <rev>" on the server side. "deepen-not" +cannot be used with "deepen", but can be used with "deepen-since". + +deepen-relative +--------------- + +If this capability is requested by the client, the semantics of +"deepen" command is changed. The "depth" argument is the depth from +the current shallow boundary, instead of the depth from remote refs. + +no-progress +----------- + +The client was started with "git clone -q" or something, and doesn't +want that side band 2. Basically the client just says "I do not +wish to receive stream 2 on sideband, so do not send it to me, and if +you did, I will drop it on the floor anyway". However, the sideband +channel 3 is still used for error responses. + +include-tag +----------- + +The 'include-tag' capability is about sending annotated tags if we are +sending objects they point to. If we pack an object to the client, and +a tag object points exactly at that object, we pack the tag object too. +In general this allows a client to get all new annotated tags when it +fetches a branch, in a single network connection. + +Clients MAY always send include-tag, hardcoding it into a request when +the server advertises this capability. The decision for a client to +request include-tag only has to do with the client's desires for tag +data, whether or not a server had advertised objects in the +refs/tags/* namespace. + +Servers MUST pack the tags if their referrant is packed and the client +has requested include-tags. + +Clients MUST be prepared for the case where a server has ignored +include-tag and has not actually sent tags in the pack. In such +cases the client SHOULD issue a subsequent fetch to acquire the tags +that include-tag would have otherwise given the client. + +The server SHOULD send include-tag, if it supports it, regardless +of whether or not there are tags available. + +report-status +------------- + +The receive-pack process can receive a 'report-status' capability, +which tells it that the client wants a report of what happened after +a packfile upload and reference update. If the pushing client requests +this capability, after unpacking and updating references the server +will respond with whether the packfile unpacked successfully and if +each reference was updated successfully. If any of those were not +successful, it will send back an error message. See pack-protocol.txt +for example messages. + +delete-refs +----------- + +If the server sends back the 'delete-refs' capability, it means that +it is capable of accepting a zero-id value as the target +value of a reference update. It is not sent back by the client, it +simply informs the client that it can be sent zero-id values +to delete references. + +quiet +----- + +If the receive-pack server advertises the 'quiet' capability, it is +capable of silencing human-readable progress output which otherwise may +be shown when processing the received pack. A send-pack client should +respond with the 'quiet' capability to suppress server-side progress +reporting if the local progress reporting is also being suppressed +(e.g., via `push -q`, or if stderr does not go to a tty). + +atomic +------ + +If the server sends the 'atomic' capability it is capable of accepting +atomic pushes. If the pushing client requests this capability, the server +will update the refs in one atomic transaction. Either all refs are +updated or none. + +push-options +------------ + +If the server sends the 'push-options' capability it is able to accept +push options after the update commands have been sent, but before the +packfile is streamed. If the pushing client requests this capability, +the server will pass the options to the pre- and post- receive hooks +that process this push request. + +allow-tip-sha1-in-want +---------------------- + +If the upload-pack server advertises this capability, fetch-pack may +send "want" lines with SHA-1s that exist at the server but are not +advertised by upload-pack. + +allow-reachable-sha1-in-want +---------------------------- + +If the upload-pack server advertises this capability, fetch-pack may +send "want" lines with SHA-1s that exist at the server but are not +advertised by upload-pack. + +push-cert=<nonce> +----------------- + +The receive-pack server that advertises this capability is willing +to accept a signed push certificate, and asks the <nonce> to be +included in the push certificate. A send-pack client MUST NOT +send a push-cert packet unless the receive-pack server advertises +this capability. + +filter +------ + +If the upload-pack server advertises the 'filter' capability, +fetch-pack may send "filter" commands to request a partial clone +or partial fetch and request that the server omit various objects +from the packfile. diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt new file mode 100644 index 000000000000..ecedb34bba54 --- /dev/null +++ b/Documentation/technical/protocol-common.txt @@ -0,0 +1,99 @@ +Documentation Common to Pack and Http Protocols +=============================================== + +ABNF Notation +------------- + +ABNF notation as described by RFC 5234 is used within the protocol documents, +except the following replacement core rules are used: +---- + HEXDIG = DIGIT / "a" / "b" / "c" / "d" / "e" / "f" +---- + +We also define the following common rules: +---- + NUL = %x00 + zero-id = 40*"0" + obj-id = 40*(HEXDIGIT) + + refname = "HEAD" + refname /= "refs/" <see discussion below> +---- + +A refname is a hierarchical octet string beginning with "refs/" and +not violating the 'git-check-ref-format' command's validation rules. +More specifically, they: + +. They can include slash `/` for hierarchical (directory) + grouping, but no slash-separated component can begin with a + dot `.`. + +. They must contain at least one `/`. This enforces the presence of a + category like `heads/`, `tags/` etc. but the actual names are not + restricted. + +. They cannot have two consecutive dots `..` anywhere. + +. They cannot have ASCII control characters (i.e. bytes whose + values are lower than \040, or \177 `DEL`), space, tilde `~`, + caret `^`, colon `:`, question-mark `?`, asterisk `*`, + or open bracket `[` anywhere. + +. They cannot end with a slash `/` or a dot `.`. + +. They cannot end with the sequence `.lock`. + +. They cannot contain a sequence `@{`. + +. They cannot contain a `\\`. + + +pkt-line Format +--------------- + +Much (but not all) of the payload is described around pkt-lines. + +A pkt-line is a variable length binary string. The first four bytes +of the line, the pkt-len, indicates the total length of the line, +in hexadecimal. The pkt-len includes the 4 bytes used to contain +the length's hexadecimal representation. + +A pkt-line MAY contain binary data, so implementors MUST ensure +pkt-line parsing/formatting routines are 8-bit clean. + +A non-binary line SHOULD BE terminated by an LF, which if present +MUST be included in the total length. Receivers MUST treat pkt-lines +with non-binary data the same whether or not they contain the trailing +LF (stripping the LF if present, and not complaining when it is +missing). + +The maximum length of a pkt-line's data component is 65516 bytes. +Implementations MUST NOT send pkt-line whose length exceeds 65520 +(65516 bytes of payload + 4 bytes of length data). + +Implementations SHOULD NOT send an empty pkt-line ("0004"). + +A pkt-line with a length field of 0 ("0000"), called a flush-pkt, +is a special case and MUST be handled differently than an empty +pkt-line ("0004"). + +---- + pkt-line = data-pkt / flush-pkt + + data-pkt = pkt-len pkt-payload + pkt-len = 4*(HEXDIG) + pkt-payload = (pkt-len - 4)*(OCTET) + + flush-pkt = "0000" +---- + +Examples (as C-style strings): + +---- + pkt-line actual value + --------------------------------- + "0006a\n" "a\n" + "0005a" "a" + "000bfoobar\n" "foobar\n" + "0004" "" +---- diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt new file mode 100644 index 000000000000..40f91f6b1ee1 --- /dev/null +++ b/Documentation/technical/protocol-v2.txt @@ -0,0 +1,455 @@ +Git Wire Protocol, Version 2 +============================ + +This document presents a specification for a version 2 of Git's wire +protocol. Protocol v2 will improve upon v1 in the following ways: + + * Instead of multiple service names, multiple commands will be + supported by a single service + * Easily extendable as capabilities are moved into their own section + of the protocol, no longer being hidden behind a NUL byte and + limited by the size of a pkt-line + * Separate out other information hidden behind NUL bytes (e.g. agent + string as a capability and symrefs can be requested using 'ls-refs') + * Reference advertisement will be omitted unless explicitly requested + * ls-refs command to explicitly request some refs + * Designed with http and stateless-rpc in mind. With clear flush + semantics the http remote helper can simply act as a proxy + +In protocol v2 communication is command oriented. When first contacting a +server a list of capabilities will advertised. Some of these capabilities +will be commands which a client can request be executed. Once a command +has completed, a client can reuse the connection and request that other +commands be executed. + +Packet-Line Framing +------------------- + +All communication is done using packet-line framing, just as in v1. See +`Documentation/technical/pack-protocol.txt` and +`Documentation/technical/protocol-common.txt` for more information. + +In protocol v2 these special packets will have the following semantics: + + * '0000' Flush Packet (flush-pkt) - indicates the end of a message + * '0001' Delimiter Packet (delim-pkt) - separates sections of a message + +Initial Client Request +---------------------- + +In general a client can request to speak protocol v2 by sending +`version=2` through the respective side-channel for the transport being +used which inevitably sets `GIT_PROTOCOL`. More information can be +found in `pack-protocol.txt` and `http-protocol.txt`. In all cases the +response from the server is the capability advertisement. + +Git Transport +~~~~~~~~~~~~~ + +When using the git:// transport, you can request to use protocol v2 by +sending "version=2" as an extra parameter: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0 + +SSH and File Transport +~~~~~~~~~~~~~~~~~~~~~~ + +When using either the ssh:// or file:// transport, the GIT_PROTOCOL +environment variable must be set explicitly to include "version=2". + +HTTP Transport +~~~~~~~~~~~~~~ + +When using the http:// or https:// transport a client makes a "smart" +info/refs request as described in `http-protocol.txt` and requests that +v2 be used by supplying "version=2" in the `Git-Protocol` header. + + C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 + C: Git-Protocol: version=2 + +A v2 server would reply: + + S: 200 OK + S: <Some headers> + S: ... + S: + S: 000eversion 2\n + S: <capability-advertisement> + +Subsequent requests are then made directly to the service +`$GIT_URL/git-upload-pack`. (This works the same for git-receive-pack). + +Capability Advertisement +------------------------ + +A server which decides to communicate (based on a request from a client) +using protocol version 2, notifies the client by sending a version string +in its initial response followed by an advertisement of its capabilities. +Each capability is a key with an optional value. Clients must ignore all +unknown keys. Semantics of unknown values are left to the definition of +each key. Some capabilities will describe commands which can be requested +to be executed by the client. + + capability-advertisement = protocol-version + capability-list + flush-pkt + + protocol-version = PKT-LINE("version 2" LF) + capability-list = *capability + capability = PKT-LINE(key[=value] LF) + + key = 1*(ALPHA | DIGIT | "-_") + value = 1*(ALPHA | DIGIT | " -_.,?\/{}[]()<>!@#$%^&*+=:;") + +Command Request +--------------- + +After receiving the capability advertisement, a client can then issue a +request to select the command it wants with any particular capabilities +or arguments. There is then an optional section where the client can +provide any command specific parameters or queries. Only a single +command can be requested at a time. + + request = empty-request | command-request + empty-request = flush-pkt + command-request = command + capability-list + [command-args] + flush-pkt + command = PKT-LINE("command=" key LF) + command-args = delim-pkt + *command-specific-arg + + command-specific-args are packet line framed arguments defined by + each individual command. + +The server will then check to ensure that the client's request is +comprised of a valid command as well as valid capabilities which were +advertised. If the request is valid the server will then execute the +command. A server MUST wait till it has received the client's entire +request before issuing a response. The format of the response is +determined by the command being executed, but in all cases a flush-pkt +indicates the end of the response. + +When a command has finished, and the client has received the entire +response from the server, a client can either request that another +command be executed or can terminate the connection. A client may +optionally send an empty request consisting of just a flush-pkt to +indicate that no more requests will be made. + +Capabilities +------------ + +There are two different types of capabilities: normal capabilities, +which can be used to convey information or alter the behavior of a +request, and commands, which are the core actions that a client wants to +perform (fetch, push, etc). + +Protocol version 2 is stateless by default. This means that all commands +must only last a single round and be stateless from the perspective of the +server side, unless the client has requested a capability indicating that +state should be maintained by the server. Clients MUST NOT require state +management on the server side in order to function correctly. This +permits simple round-robin load-balancing on the server side, without +needing to worry about state management. + +agent +~~~~~ + +The server can advertise the `agent` capability with a value `X` (in the +form `agent=X`) to notify the client that the server is running version +`X`. The client may optionally send its own agent string by including +the `agent` capability with a value `Y` (in the form `agent=Y`) in its +request to the server (but it MUST NOT do so if the server did not +advertise the agent capability). The `X` and `Y` strings may contain any +printable ASCII characters except space (i.e., the byte range 32 < x < +127), and are typically of the form "package/version" (e.g., +"git/1.8.3.1"). The agent strings are purely informative for statistics +and debugging purposes, and MUST NOT be used to programmatically assume +the presence or absence of particular features. + +ls-refs +~~~~~~~ + +`ls-refs` is the command used to request a reference advertisement in v2. +Unlike the current reference advertisement, ls-refs takes in arguments +which can be used to limit the refs sent from the server. + +Additional features not supported in the base command will be advertised +as the value of the command in the capability advertisement in the form +of a space separated list of features: "<command>=<feature 1> <feature 2>" + +ls-refs takes in the following arguments: + + symrefs + In addition to the object pointed by it, show the underlying ref + pointed by it when showing a symbolic ref. + peel + Show peeled tags. + ref-prefix <prefix> + When specified, only references having a prefix matching one of + the provided prefixes are displayed. + +The output of ls-refs is as follows: + + output = *ref + flush-pkt + ref = PKT-LINE(obj-id SP refname *(SP ref-attribute) LF) + ref-attribute = (symref | peeled) + symref = "symref-target:" symref-target + peeled = "peeled:" obj-id + +fetch +~~~~~ + +`fetch` is the command used to fetch a packfile in v2. It can be looked +at as a modified version of the v1 fetch where the ref-advertisement is +stripped out (since the `ls-refs` command fills that role) and the +message format is tweaked to eliminate redundancies and permit easy +addition of future extensions. + +Additional features not supported in the base command will be advertised +as the value of the command in the capability advertisement in the form +of a space separated list of features: "<command>=<feature 1> <feature 2>" + +A `fetch` request can take the following arguments: + + want <oid> + Indicates to the server an object which the client wants to + retrieve. Wants can be anything and are not limited to + advertised objects. + + have <oid> + Indicates to the server an object which the client has locally. + This allows the server to make a packfile which only contains + the objects that the client needs. Multiple 'have' lines can be + supplied. + + done + Indicates to the server that negotiation should terminate (or + not even begin if performing a clone) and that the server should + use the information supplied in the request to construct the + packfile. + + thin-pack + Request that a thin pack be sent, which is a pack with deltas + which reference base objects not contained within the pack (but + are known to exist at the receiving end). This can reduce the + network traffic significantly, but it requires the receiving end + to know how to "thicken" these packs by adding the missing bases + to the pack. + + no-progress + Request that progress information that would normally be sent on + side-band channel 2, during the packfile transfer, should not be + sent. However, the side-band channel 3 is still used for error + responses. + + include-tag + Request that annotated tags should be sent if the objects they + point to are being sent. + + ofs-delta + Indicate that the client understands PACKv2 with delta referring + to its base by position in pack rather than by an oid. That is, + they can read OBJ_OFS_DELTA (ake type 6) in a packfile. + +If the 'shallow' feature is advertised the following arguments can be +included in the clients request as well as the potential addition of the +'shallow-info' section in the server's response as explained below. + + shallow <oid> + A client must notify the server of all commits for which it only + has shallow copies (meaning that it doesn't have the parents of + a commit) by supplying a 'shallow <oid>' line for each such + object so that the server is aware of the limitations of the + client's history. This is so that the server is aware that the + client may not have all objects reachable from such commits. + + deepen <depth> + Requests that the fetch/clone should be shallow having a commit + depth of <depth> relative to the remote side. + + deepen-relative + Requests that the semantics of the "deepen" command be changed + to indicate that the depth requested is relative to the client's + current shallow boundary, instead of relative to the requested + commits. + + deepen-since <timestamp> + Requests that the shallow clone/fetch should be cut at a + specific time, instead of depth. Internally it's equivalent to + doing "git rev-list --max-age=<timestamp>". Cannot be used with + "deepen". + + deepen-not <rev> + Requests that the shallow clone/fetch should be cut at a + specific revision specified by '<rev>', instead of a depth. + Internally it's equivalent of doing "git rev-list --not <rev>". + Cannot be used with "deepen", but can be used with + "deepen-since". + +If the 'filter' feature is advertised, the following argument can be +included in the client's request: + + filter <filter-spec> + Request that various objects from the packfile be omitted + using one of several filtering techniques. These are intended + for use with partial clone and partial fetch operations. See + `rev-list` for possible "filter-spec" values. When communicating + with other processes, senders SHOULD translate scaled integers + (e.g. "1k") into a fully-expanded form (e.g. "1024") to aid + interoperability with older receivers that may not understand + newly-invented scaling suffixes. However, receivers SHOULD + accept the following suffixes: 'k', 'm', and 'g' for 1024, + 1048576, and 1073741824, respectively. + +If the 'ref-in-want' feature is advertised, the following argument can +be included in the client's request as well as the potential addition of +the 'wanted-refs' section in the server's response as explained below. + + want-ref <ref> + Indicates to the server that the client wants to retrieve a + particular ref, where <ref> is the full name of a ref on the + server. + +If the 'sideband-all' feature is advertised, the following argument can be +included in the client's request: + + sideband-all + Instruct the server to send the whole response multiplexed, not just + the packfile section. All non-flush and non-delim PKT-LINE in the + response (not only in the packfile section) will then start with a byte + indicating its sideband (1, 2, or 3), and the server may send "0005\2" + (a PKT-LINE of sideband 2 with no payload) as a keepalive packet. + +The response of `fetch` is broken into a number of sections separated by +delimiter packets (0001), with each section beginning with its section +header. + + output = *section + section = (acknowledgments | shallow-info | wanted-refs | packfile) + (flush-pkt | delim-pkt) + + acknowledgments = PKT-LINE("acknowledgments" LF) + (nak | *ack) + (ready) + ready = PKT-LINE("ready" LF) + nak = PKT-LINE("NAK" LF) + ack = PKT-LINE("ACK" SP obj-id LF) + + shallow-info = PKT-LINE("shallow-info" LF) + *PKT-LINE((shallow | unshallow) LF) + shallow = "shallow" SP obj-id + unshallow = "unshallow" SP obj-id + + wanted-refs = PKT-LINE("wanted-refs" LF) + *PKT-LINE(wanted-ref LF) + wanted-ref = obj-id SP refname + + packfile = PKT-LINE("packfile" LF) + *PKT-LINE(%x01-03 *%x00-ff) + + acknowledgments section + * If the client determines that it is finished with negotiations + by sending a "done" line, the acknowledgments sections MUST be + omitted from the server's response. + + * Always begins with the section header "acknowledgments" + + * The server will respond with "NAK" if none of the object ids sent + as have lines were common. + + * The server will respond with "ACK obj-id" for all of the + object ids sent as have lines which are common. + + * A response cannot have both "ACK" lines as well as a "NAK" + line. + + * The server will respond with a "ready" line indicating that + the server has found an acceptable common base and is ready to + make and send a packfile (which will be found in the packfile + section of the same response) + + * If the server has found a suitable cut point and has decided + to send a "ready" line, then the server can decide to (as an + optimization) omit any "ACK" lines it would have sent during + its response. This is because the server will have already + determined the objects it plans to send to the client and no + further negotiation is needed. + + shallow-info section + * If the client has requested a shallow fetch/clone, a shallow + client requests a fetch or the server is shallow then the + server's response may include a shallow-info section. The + shallow-info section will be included if (due to one of the + above conditions) the server needs to inform the client of any + shallow boundaries or adjustments to the clients already + existing shallow boundaries. + + * Always begins with the section header "shallow-info" + + * If a positive depth is requested, the server will compute the + set of commits which are no deeper than the desired depth. + + * The server sends a "shallow obj-id" line for each commit whose + parents will not be sent in the following packfile. + + * The server sends an "unshallow obj-id" line for each commit + which the client has indicated is shallow, but is no longer + shallow as a result of the fetch (due to its parents being + sent in the following packfile). + + * The server MUST NOT send any "unshallow" lines for anything + which the client has not indicated was shallow as a part of + its request. + + * This section is only included if a packfile section is also + included in the response. + + wanted-refs section + * This section is only included if the client has requested a + ref using a 'want-ref' line and if a packfile section is also + included in the response. + + * Always begins with the section header "wanted-refs". + + * The server will send a ref listing ("<oid> <refname>") for + each reference requested using 'want-ref' lines. + + * The server MUST NOT send any refs which were not requested + using 'want-ref' lines. + + packfile section + * This section is only included if the client has sent 'want' + lines in its request and either requested that no more + negotiation be done by sending 'done' or if the server has + decided it has found a sufficient cut point to produce a + packfile. + + * Always begins with the section header "packfile" + + * The transmission of the packfile begins immediately after the + section header + + * The data transfer of the packfile is always multiplexed, using + the same semantics of the 'side-band-64k' capability from + protocol version 1. This means that each packet, during the + packfile data stream, is made up of a leading 4-byte pkt-line + length (typical of the pkt-line format), followed by a 1-byte + stream code, followed by the actual data. + + The stream code can be one of: + 1 - pack data + 2 - progress messages + 3 - fatal error message just before stream aborts + +server-option +~~~~~~~~~~~~~ + +If advertised, indicates that any number of server specific options can be +included in a request. This is done by sending each option as a +"server-option=<option>" capability line in the capability-list section of +a request. + +The provided options must not contain a NUL or LF character. diff --git a/Documentation/technical/racy-git.txt b/Documentation/technical/racy-git.txt new file mode 100644 index 000000000000..4a8be4d144cf --- /dev/null +++ b/Documentation/technical/racy-git.txt @@ -0,0 +1,201 @@ +Use of index and Racy Git problem +================================= + +Background +---------- + +The index is one of the most important data structures in Git. +It represents a virtual working tree state by recording list of +paths and their object names and serves as a staging area to +write out the next tree object to be committed. The state is +"virtual" in the sense that it does not necessarily have to, and +often does not, match the files in the working tree. + +There are cases Git needs to examine the differences between the +virtual working tree state in the index and the files in the +working tree. The most obvious case is when the user asks `git +diff` (or its low level implementation, `git diff-files`) or +`git-ls-files --modified`. In addition, Git internally checks +if the files in the working tree are different from what are +recorded in the index to avoid stomping on local changes in them +during patch application, switching branches, and merging. + +In order to speed up this comparison between the files in the +working tree and the index entries, the index entries record the +information obtained from the filesystem via `lstat(2)` system +call when they were last updated. When checking if they differ, +Git first runs `lstat(2)` on the files and compares the result +with this information (this is what was originally done by the +`ce_match_stat()` function, but the current code does it in +`ce_match_stat_basic()` function). If some of these "cached +stat information" fields do not match, Git can tell that the +files are modified without even looking at their contents. + +Note: not all members in `struct stat` obtained via `lstat(2)` +are used for this comparison. For example, `st_atime` obviously +is not useful. Currently, Git compares the file type (regular +files vs symbolic links) and executable bits (only for regular +files) from `st_mode` member, `st_mtime` and `st_ctime` +timestamps, `st_uid`, `st_gid`, `st_ino`, and `st_size` members. +With a `USE_STDEV` compile-time option, `st_dev` is also +compared, but this is not enabled by default because this member +is not stable on network filesystems. With `USE_NSEC` +compile-time option, `st_mtim.tv_nsec` and `st_ctim.tv_nsec` +members are also compared. On Linux, this is not enabled by default +because in-core timestamps can have finer granularity than +on-disk timestamps, resulting in meaningless changes when an +inode is evicted from the inode cache. See commit 8ce13b0 +of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git +([PATCH] Sync in core time granularity with filesystems, +2005-01-04). This patch is included in kernel 2.6.11 and newer, but +only fixes the issue for file systems with exactly 1 ns or 1 s +resolution. Other file systems are still broken in current Linux +kernels (e.g. CEPH, CIFS, NTFS, UDF), see +https://lkml.org/lkml/2015/6/9/714 + +Racy Git +-------- + +There is one slight problem with the optimization based on the +cached stat information. Consider this sequence: + + : modify 'foo' + $ git update-index 'foo' + : modify 'foo' again, in-place, without changing its size + +The first `update-index` computes the object name of the +contents of file `foo` and updates the index entry for `foo` +along with the `struct stat` information. If the modification +that follows it happens very fast so that the file's `st_mtime` +timestamp does not change, after this sequence, the cached stat +information the index entry records still exactly match what you +would see in the filesystem, even though the file `foo` is now +different. +This way, Git can incorrectly think files in the working tree +are unmodified even though they actually are. This is called +the "racy Git" problem (discovered by Pasky), and the entries +that appear clean when they may not be because of this problem +are called "racily clean". + +To avoid this problem, Git does two things: + +. When the cached stat information says the file has not been + modified, and the `st_mtime` is the same as (or newer than) + the timestamp of the index file itself (which is the time `git + update-index foo` finished running in the above example), it + also compares the contents with the object registered in the + index entry to make sure they match. + +. When the index file is updated that contains racily clean + entries, cached `st_size` information is truncated to zero + before writing a new version of the index file. + +Because the index file itself is written after collecting all +the stat information from updated paths, `st_mtime` timestamp of +it is usually the same as or newer than any of the paths the +index contains. And no matter how quick the modification that +follows `git update-index foo` finishes, the resulting +`st_mtime` timestamp on `foo` cannot get a value earlier +than the index file. Therefore, index entries that can be +racily clean are limited to the ones that have the same +timestamp as the index file itself. + +The callers that want to check if an index entry matches the +corresponding file in the working tree continue to call +`ce_match_stat()`, but with this change, `ce_match_stat()` uses +`ce_modified_check_fs()` to see if racily clean ones are +actually clean after comparing the cached stat information using +`ce_match_stat_basic()`. + +The problem the latter solves is this sequence: + + $ git update-index 'foo' + : modify 'foo' in-place without changing its size + : wait for enough time + $ git update-index 'bar' + +Without the latter, the timestamp of the index file gets a newer +value, and falsely clean entry `foo` would not be caught by the +timestamp comparison check done with the former logic anymore. +The latter makes sure that the cached stat information for `foo` +would never match with the file in the working tree, so later +checks by `ce_match_stat_basic()` would report that the index entry +does not match the file and Git does not have to fall back on more +expensive `ce_modified_check_fs()`. + + +Runtime penalty +--------------- + +The runtime penalty of falling back to `ce_modified_check_fs()` +from `ce_match_stat()` can be very expensive when there are many +racily clean entries. An obvious way to artificially create +this situation is to give the same timestamp to all the files in +the working tree in a large project, run `git update-index` on +them, and give the same timestamp to the index file: + + $ date >.datestamp + $ git ls-files | xargs touch -r .datestamp + $ git ls-files | git update-index --stdin + $ touch -r .datestamp .git/index + +This will make all index entries racily clean. The linux project, for +example, there are over 20,000 files in the working tree. On my +Athlon 64 X2 3800+, after the above: + + $ /usr/bin/time git diff-files + 1.68user 0.54system 0:02.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k + 0inputs+0outputs (0major+67111minor)pagefaults 0swaps + $ git update-index MAINTAINERS + $ /usr/bin/time git diff-files + 0.02user 0.12system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k + 0inputs+0outputs (0major+935minor)pagefaults 0swaps + +Running `git update-index` in the middle checked the racily +clean entries, and left the cached `st_mtime` for all the paths +intact because they were actually clean (so this step took about +the same amount of time as the first `git diff-files`). After +that, they are not racily clean anymore but are truly clean, so +the second invocation of `git diff-files` fully took advantage +of the cached stat information. + + +Avoiding runtime penalty +------------------------ + +In order to avoid the above runtime penalty, post 1.4.2 Git used +to have a code that made sure the index file +got timestamp newer than the youngest files in the index when +there are many young files with the same timestamp as the +resulting index file would otherwise would have by waiting +before finishing writing the index file out. + +I suspected that in practice the situation where many paths in the +index are all racily clean was quite rare. The only code paths +that can record recent timestamp for large number of paths are: + +. Initial `git add .` of a large project. + +. `git checkout` of a large project from an empty index into an + unpopulated working tree. + +Note: switching branches with `git checkout` keeps the cached +stat information of existing working tree files that are the +same between the current branch and the new branch, which are +all older than the resulting index file, and they will not +become racily clean. Only the files that are actually checked +out can become racily clean. + +In a large project where raciness avoidance cost really matters, +however, the initial computation of all object names in the +index takes more than one second, and the index file is written +out after all that happens. Therefore the timestamp of the +index file will be more than one seconds later than the +youngest file in the working tree. This means that in these +cases there actually will not be any racily clean entry in +the resulting index. + +Based on this discussion, the current code does not use the +"workaround" to avoid the runtime penalty that does not exist in +practice anymore. This was done with commit 0fc82cff on Aug 15, +2006. diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt new file mode 100644 index 000000000000..7844ef30ffde --- /dev/null +++ b/Documentation/technical/repository-version.txt @@ -0,0 +1,102 @@ +== Git Repository Format Versions + +Every git repository is marked with a numeric version in the +`core.repositoryformatversion` key of its `config` file. This version +specifies the rules for operating on the on-disk repository data. An +implementation of git which does not understand a particular version +advertised by an on-disk repository MUST NOT operate on that repository; +doing so risks not only producing wrong results, but actually losing +data. + +Because of this rule, version bumps should be kept to an absolute +minimum. Instead, we generally prefer these strategies: + + - bumping format version numbers of individual data files (e.g., + index, packfiles, etc). This restricts the incompatibilities only to + those files. + + - introducing new data that gracefully degrades when used by older + clients (e.g., pack bitmap files are ignored by older clients, which + simply do not take advantage of the optimization they provide). + +A whole-repository format version bump should only be part of a change +that cannot be independently versioned. For instance, if one were to +change the reachability rules for objects, or the rules for locking +refs, that would require a bump of the repository format version. + +Note that this applies only to accessing the repository's disk contents +directly. An older client which understands only format `0` may still +connect via `git://` to a repository using format `1`, as long as the +server process understands format `1`. + +The preferred strategy for rolling out a version bump (whether whole +repository or for a single file) is to teach git to read the new format, +and allow writing the new format with a config switch or command line +option (for experimentation or for those who do not care about backwards +compatibility with older gits). Then after a long period to allow the +reading capability to become common, we may switch to writing the new +format by default. + +The currently defined format versions are: + +=== Version `0` + +This is the format defined by the initial version of git, including but +not limited to the format of the repository directory, the repository +configuration file, and the object and ref storage. Specifying the +complete behavior of git is beyond the scope of this document. + +=== Version `1` + +This format is identical to version `0`, with the following exceptions: + + 1. When reading the `core.repositoryformatversion` variable, a git + implementation which supports version 1 MUST also read any + configuration keys found in the `extensions` section of the + configuration file. + + 2. If a version-1 repository specifies any `extensions.*` keys that + the running git has not implemented, the operation MUST NOT + proceed. Similarly, if the value of any known key is not understood + by the implementation, the operation MUST NOT proceed. + +Note that if no extensions are specified in the config file, then +`core.repositoryformatversion` SHOULD be set to `0` (setting it to `1` +provides no benefit, and makes the repository incompatible with older +implementations of git). + +This document will serve as the master list for extensions. Any +implementation wishing to define a new extension should make a note of +it here, in order to claim the name. + +The defined extensions are: + +==== `noop` + +This extension does not change git's behavior at all. It is useful only +for testing format-1 compatibility. + +==== `preciousObjects` + +When the config key `extensions.preciousObjects` is set to `true`, +objects in the repository MUST NOT be deleted (e.g., by `git-prune` or +`git repack -d`). + +==== `partialclone` + +When the config key `extensions.partialclone` is set, it indicates +that the repo was created with a partial clone (or later performed +a partial fetch) and that the remote may have omitted sending +certain unwanted objects. Such a remote is called a "promisor remote" +and it promises that all such omitted objects can be fetched from it +in the future. + +The value of this key is the name of the promisor remote. + +==== `worktreeConfig` + +If set, by default "git config" reads from both "config" and +"config.worktree" file from GIT_DIR in that order. In +multiple working directory mode, "config" file is shared while +"config.worktree" is per-working directory (i.e., it's in +GIT_COMMON_DIR/worktrees/<id>/config.worktree) diff --git a/Documentation/technical/rerere.txt b/Documentation/technical/rerere.txt new file mode 100644 index 000000000000..aa22d7ace893 --- /dev/null +++ b/Documentation/technical/rerere.txt @@ -0,0 +1,186 @@ +Rerere +====== + +This document describes the rerere logic. + +Conflict normalization +---------------------- + +To ensure recorded conflict resolutions can be looked up in the rerere +database, even when branches are merged in a different order, +different branches are merged that result in the same conflict, or +when different conflict style settings are used, rerere normalizes the +conflicts before writing them to the rerere database. + +Different conflict styles and branch names are normalized by stripping +the labels from the conflict markers, and removing the common ancestor +version from the `diff3` conflict style. Branches that are merged +in different order are normalized by sorting the conflict hunks. More +on each of those steps in the following sections. + +Once these two normalization operations are applied, a conflict ID is +calculated based on the normalized conflict, which is later used by +rerere to look up the conflict in the rerere database. + +Removing the common ancestor version +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Say we have three branches AB, AC and AC2. The common ancestor of +these branches has a file with a line containing the string "A" (for +brevity this is called "line A" in the rest of the document). In +branch AB this line is changed to "B", in AC, this line is changed to +"C", and branch AC2 is forked off of AC, after the line was changed to +"C". + +Forking a branch ABAC off of branch AB and then merging AC into it, we +get a conflict like the following: + + <<<<<<< HEAD + B + ======= + C + >>>>>>> AC + +Doing the analogous with AC2 (forking a branch ABAC2 off of branch AB +and then merging branch AC2 into it), using the diff3 conflict style, +we get a conflict like the following: + + <<<<<<< HEAD + B + ||||||| merged common ancestors + A + ======= + C + >>>>>>> AC2 + +By resolving this conflict, to leave line D, the user declares: + + After examining what branches AB and AC did, I believe that making + line A into line D is the best thing to do that is compatible with + what AB and AC wanted to do. + +As branch AC2 refers to the same commit as AC, the above implies that +this is also compatible what AB and AC2 wanted to do. + +By extension, this means that rerere should recognize that the above +conflicts are the same. To do this, the labels on the conflict +markers are stripped, and the common ancestor version is removed. The above +examples would both result in the following normalized conflict: + + <<<<<<< + B + ======= + C + >>>>>>> + +Sorting hunks +~~~~~~~~~~~~~ + +As before, lets imagine that a common ancestor had a file with line A +its early part, and line X in its late part. And then four branches +are forked that do these things: + + - AB: changes A to B + - AC: changes A to C + - XY: changes X to Y + - XZ: changes X to Z + +Now, forking a branch ABAC off of branch AB and then merging AC into +it, and forking a branch ACAB off of branch AC and then merging AB +into it, would yield the conflict in a different order. The former +would say "A became B or C, what now?" while the latter would say "A +became C or B, what now?" + +As a reminder, the act of merging AC into ABAC and resolving the +conflict to leave line D means that the user declares: + + After examining what branches AB and AC did, I believe that + making line A into line D is the best thing to do that is + compatible with what AB and AC wanted to do. + +So the conflict we would see when merging AB into ACAB should be +resolved the same way---it is the resolution that is in line with that +declaration. + +Imagine that similarly previously a branch XYXZ was forked from XY, +and XZ was merged into it, and resolved "X became Y or Z" into "X +became W". + +Now, if a branch ABXY was forked from AB and then merged XY, then ABXY +would have line B in its early part and line Y in its later part. +Such a merge would be quite clean. We can construct 4 combinations +using these four branches ((AB, AC) x (XY, XZ)). + +Merging ABXY and ACXZ would make "an early A became B or C, a late X +became Y or Z" conflict, while merging ACXY and ABXZ would make "an +early A became C or B, a late X became Y or Z". We can see there are +4 combinations of ("B or C", "C or B") x ("X or Y", "Y or X"). + +By sorting, the conflict is given its canonical name, namely, "an +early part became B or C, a late part becames X or Y", and whenever +any of these four patterns appear, and we can get to the same conflict +and resolution that we saw earlier. + +Without the sorting, we'd have to somehow find a previous resolution +from combinatorial explosion. + +Conflict ID calculation +~~~~~~~~~~~~~~~~~~~~~~~ + +Once the conflict normalization is done, the conflict ID is calculated +as the sha1 hash of the conflict hunks appended to each other, +separated by <NUL> characters. The conflict markers are stripped out +before the sha1 is calculated. So in the example above, where we +merge branch AC which changes line A to line C, into branch AB, which +changes line A to line C, the conflict ID would be +SHA1('B<NUL>C<NUL>'). + +If there are multiple conflicts in one file, the sha1 is calculated +the same way with all hunks appended to each other, in the order in +which they appear in the file, separated by a <NUL> character. + +Nested conflicts +~~~~~~~~~~~~~~~~ + +Nested conflicts are handled very similarly to "simple" conflicts. +Similar to simple conflicts, the conflict is first normalized by +stripping the labels from conflict markers, stripping the common ancestor +version, and the sorting the conflict hunks, both for the outer and the +inner conflict. This is done recursively, so any number of nested +conflicts can be handled. + +Note that this only works for conflict markers that "cleanly nest". If +there are any unmatched conflict markers, rerere will fail to handle +the conflict and record a conflict resolution. + +The only difference is in how the conflict ID is calculated. For the +inner conflict, the conflict markers themselves are not stripped out +before calculating the sha1. + +Say we have the following conflict for example: + + <<<<<<< HEAD + 1 + ======= + <<<<<<< HEAD + 3 + ======= + 2 + >>>>>>> branch-2 + >>>>>>> branch-3~ + +After stripping out the labels of the conflict markers, and sorting +the hunks, the conflict would look as follows: + + <<<<<<< + 1 + ======= + <<<<<<< + 2 + ======= + 3 + >>>>>>> + >>>>>>> + +and finally the conflict ID would be calculated as: +`sha1('1<NUL><<<<<<<\n3\n=======\n2\n>>>>>>><NUL>')` diff --git a/Documentation/technical/send-pack-pipeline.txt b/Documentation/technical/send-pack-pipeline.txt new file mode 100644 index 000000000000..9b5a0bc18667 --- /dev/null +++ b/Documentation/technical/send-pack-pipeline.txt @@ -0,0 +1,63 @@ +Git-send-pack internals +======================= + +Overall operation +----------------- + +. Connects to the remote side and invokes git-receive-pack. + +. Learns what refs the remote has and what commit they point at. + Matches them to the refspecs we are pushing. + +. Checks if there are non-fast-forwards. Unlike fetch-pack, + the repository send-pack runs in is supposed to be a superset + of the recipient in fast-forward cases, so there is no need + for want/have exchanges, and fast-forward check can be done + locally. Tell the result to the other end. + +. Calls pack_objects() which generates a packfile and sends it + over to the other end. + +. If the remote side is new enough (v1.1.0 or later), wait for + the unpack and hook status from the other end. + +. Exit with appropriate error codes. + + +Pack_objects pipeline +--------------------- + +This function gets one file descriptor (`fd`) which is either a +socket (over the network) or a pipe (local). What's written to +this fd goes to git-receive-pack to be unpacked. + + send-pack ---> fd ---> receive-pack + +The function pack_objects creates a pipe and then forks. The +forked child execs pack-objects with --revs to receive revision +parameters from its standard input. This process will write the +packfile to the other end. + + send-pack + | + pack_objects() ---> fd ---> receive-pack + | ^ (pipe) + v | + (child) + +The child dup2's to arrange its standard output to go back to +the other end, and read its standard input to come from the +pipe. After that it exec's pack-objects. On the other hand, +the parent process, before starting to feed the child pipeline, +closes the reading side of the pipe and fd to receive-pack. + + send-pack + | + pack_objects(parent) + | + v [0] + pack-objects [0] ---> receive-pack + + +[jc: the pipeline was much more complex and needed documentation before + I understood an earlier bug, but now it is trivial and straightforward.] diff --git a/Documentation/technical/shallow.txt b/Documentation/technical/shallow.txt new file mode 100644 index 000000000000..01dedfe9ffed --- /dev/null +++ b/Documentation/technical/shallow.txt @@ -0,0 +1,60 @@ +Shallow commits +=============== + +.Definition +********************************************************* +Shallow commits do have parents, but not in the shallow +repo, and therefore grafts are introduced pretending that +these commits have no parents. +********************************************************* + +$GIT_DIR/shallow lists commit object names and tells Git to +pretend as if they are root commits (e.g. "git log" traversal +stops after showing them; "git fsck" does not complain saying +the commits listed on their "parent" lines do not exist). + +Each line contains exactly one SHA-1. When read, a commit_graft +will be constructed, which has nr_parent < 0 to make it easier +to discern from user provided grafts. + +Note that the shallow feature could not be changed easily to +use replace refs: a commit containing a `mergetag` is not allowed +to be replaced, not even by a root commit. Such a commit can be +made shallow, though. Also, having a `shallow` file explicitly +listing all the commits made shallow makes it a *lot* easier to +do shallow-specific things such as to deepen the history. + +Since fsck-objects relies on the library to read the objects, +it honours shallow commits automatically. + +There are some unfinished ends of the whole shallow business: + +- maybe we have to force non-thin packs when fetching into a + shallow repo (ATM they are forced non-thin). + +- A special handling of a shallow upstream is needed. At some + stage, upload-pack has to check if it sends a shallow commit, + and it should send that information early (or fail, if the + client does not support shallow repositories). There is no + support at all for this in this patch series. + +- Instead of locking $GIT_DIR/shallow at the start, just + the timestamp of it is noted, and when it comes to writing it, + a check is performed if the mtime is still the same, dying if + it is not. + +- It is unclear how "push into/from a shallow repo" should behave. + +- If you deepen a history, you'd want to get the tags of the + newly stored (but older!) commits. This does not work right now. + +To make a shallow clone, you can call "git-clone --depth 20 repo". +The result contains only commit chains with a length of at most 20. +It also writes an appropriate $GIT_DIR/shallow. + +You can deepen a shallow repository with "git-fetch --depth 20 +repo branch", which will fetch branch from repo, but stop at depth +20, updating $GIT_DIR/shallow. + +The special depth 2147483647 (or 0x7fffffff, the largest positive +number a signed 32-bit integer can contain) means infinite depth. diff --git a/Documentation/technical/signature-format.txt b/Documentation/technical/signature-format.txt new file mode 100644 index 000000000000..2c9406a56a88 --- /dev/null +++ b/Documentation/technical/signature-format.txt @@ -0,0 +1,186 @@ +Git signature format +==================== + +== Overview + +Git uses cryptographic signatures in various places, currently objects (tags, +commits, mergetags) and transactions (pushes). In every case, the command which +is about to create an object or transaction determines a payload from that, +calls gpg to obtain a detached signature for the payload (`gpg -bsa`) and +embeds the signature into the object or transaction. + +Signatures always begin with `-----BEGIN PGP SIGNATURE-----` +and end with `-----END PGP SIGNATURE-----`, unless gpg is told to +produce RFC1991 signatures which use `MESSAGE` instead of `SIGNATURE`. + +The signed payload and the way the signature is embedded depends +on the type of the object resp. transaction. + +== Tag signatures + +- created by: `git tag -s` +- payload: annotated tag object +- embedding: append the signature to the unsigned tag object +- example: tag `signedtag` with subject `signed tag` + +---- +object 04b871796dc0420f8e7561a895b52484b701d51a +type commit +tag signedtag +tagger C O Mitter <committer@example.com> 1465981006 +0000 + +signed tag + +signed tag message body +-----BEGIN PGP SIGNATURE----- +Version: GnuPG v1 + +iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn +rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh +8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods +q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0 +rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x +lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E= +=jpXa +-----END PGP SIGNATURE----- +---- + +- verify with: `git verify-tag [-v]` or `git tag -v` + +---- +gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189 +gpg: Good signature from "Eris Discordia <discord@example.net>" +gpg: WARNING: This key is not certified with a trusted signature! +gpg: There is no indication that the signature belongs to the owner. +Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +object 04b871796dc0420f8e7561a895b52484b701d51a +type commit +tag signedtag +tagger C O Mitter <committer@example.com> 1465981006 +0000 + +signed tag + +signed tag message body +---- + +== Commit signatures + +- created by: `git commit -S` +- payload: commit object +- embedding: header entry `gpgsig` + (content is preceded by a space) +- example: commit with subject `signed commit` + +---- +tree eebfed94e75e7760540d1485c740902590a00332 +parent 04b871796dc0420f8e7561a895b52484b701d51a +author A U Thor <author@example.com> 1465981137 +0000 +committer C O Mitter <committer@example.com> 1465981137 +0000 +gpgsig -----BEGIN PGP SIGNATURE----- + Version: GnuPG v1 + + iQEcBAABAgAGBQJXYRjRAAoJEGEJLoW3InGJ3IwIAIY4SA6GxY3BjL60YyvsJPh/ + HRCJwH+w7wt3Yc/9/bW2F+gF72kdHOOs2jfv+OZhq0q4OAN6fvVSczISY/82LpS7 + DVdMQj2/YcHDT4xrDNBnXnviDO9G7am/9OE77kEbXrp7QPxvhjkicHNwy2rEflAA + zn075rtEERDHr8nRYiDh8eVrefSO7D+bdQ7gv+7GsYMsd2auJWi1dHOSfTr9HIF4 + HJhWXT9d2f8W+diRYXGh4X0wYiGg6na/soXc+vdtDYBzIxanRqjg8jCAeo1eOTk1 + EdTwhcTZlI0x5pvJ3H0+4hA2jtldVtmPM4OTB0cTrEWBad7XV6YgiyuII73Ve3I= + =jKHM + -----END PGP SIGNATURE----- + +signed commit + +signed commit message body +---- + +- verify with: `git verify-commit [-v]` (or `git show --show-signature`) + +---- +gpg: Signature made Wed Jun 15 10:58:57 2016 CEST using RSA key ID B7227189 +gpg: Good signature from "Eris Discordia <discord@example.net>" +gpg: WARNING: This key is not certified with a trusted signature! +gpg: There is no indication that the signature belongs to the owner. +Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +tree eebfed94e75e7760540d1485c740902590a00332 +parent 04b871796dc0420f8e7561a895b52484b701d51a +author A U Thor <author@example.com> 1465981137 +0000 +committer C O Mitter <committer@example.com> 1465981137 +0000 + +signed commit + +signed commit message body +---- + +== Mergetag signatures + +- created by: `git merge` on signed tag +- payload/embedding: the whole signed tag object is embedded into + the (merge) commit object as header entry `mergetag` +- example: merge of the signed tag `signedtag` as above + +---- +tree c7b1cff039a93f3600a1d18b82d26688668c7dea +parent c33429be94b5f2d3ee9b0adad223f877f174b05d +parent 04b871796dc0420f8e7561a895b52484b701d51a +author A U Thor <author@example.com> 1465982009 +0000 +committer C O Mitter <committer@example.com> 1465982009 +0000 +mergetag object 04b871796dc0420f8e7561a895b52484b701d51a + type commit + tag signedtag + tagger C O Mitter <committer@example.com> 1465981006 +0000 + + signed tag + + signed tag message body + -----BEGIN PGP SIGNATURE----- + Version: GnuPG v1 + + iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn + rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh + 8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods + q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0 + rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x + lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E= + =jpXa + -----END PGP SIGNATURE----- + +Merge tag 'signedtag' into downstream + +signed tag + +signed tag message body + +# gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189 +# gpg: Good signature from "Eris Discordia <discord@example.net>" +# gpg: WARNING: This key is not certified with a trusted signature! +# gpg: There is no indication that the signature belongs to the owner. +# Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +---- + +- verify with: verification is embedded in merge commit message by default, + alternatively with `git show --show-signature`: + +---- +commit 9863f0c76ff78712b6800e199a46aa56afbcbd49 +merged tag 'signedtag' +gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189 +gpg: Good signature from "Eris Discordia <discord@example.net>" +gpg: WARNING: This key is not certified with a trusted signature! +gpg: There is no indication that the signature belongs to the owner. +Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +Merge: c33429b 04b8717 +Author: A U Thor <author@example.com> +Date: Wed Jun 15 09:13:29 2016 +0000 + + Merge tag 'signedtag' into downstream + + signed tag + + signed tag message body + + # gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189 + # gpg: Good signature from "Eris Discordia <discord@example.net>" + # gpg: WARNING: This key is not certified with a trusted signature! + # gpg: There is no indication that the signature belongs to the owner. + # Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +---- diff --git a/Documentation/technical/trivial-merge.txt b/Documentation/technical/trivial-merge.txt new file mode 100644 index 000000000000..1f1c33d0da30 --- /dev/null +++ b/Documentation/technical/trivial-merge.txt @@ -0,0 +1,121 @@ +Trivial merge rules +=================== + +This document describes the outcomes of the trivial merge logic in read-tree. + +One-way merge +------------- + +This replaces the index with a different tree, keeping the stat info +for entries that don't change, and allowing -u to make the minimum +required changes to the working tree to have it match. + +Entries marked '+' have stat information. Spaces marked '*' don't +affect the result. + + index tree result + ----------------------- + * (empty) (empty) + (empty) tree tree + index+ tree tree + index+ index index+ + +Two-way merge +------------- + +It is permitted for the index to lack an entry; this does not prevent +any case from applying. + +If the index exists, it is an error for it not to match either the old +or the result. + +If multiple cases apply, the one used is listed first. + +A result which changes the index is an error if the index is not empty +and not up to date. + +Entries marked '+' have stat information. Spaces marked '*' don't +affect the result. + + case index old new result + ------------------------------------- + 0/2 (empty) * (empty) (empty) + 1/3 (empty) * new new + 4/5 index+ (empty) (empty) index+ + 6/7 index+ (empty) index index+ + 10 index+ index (empty) (empty) + 14/15 index+ old old index+ + 18/19 index+ old index index+ + 20 index+ index new new + +Three-way merge +--------------- + +It is permitted for the index to lack an entry; this does not prevent +any case from applying. + +If the index exists, it is an error for it not to match either the +head or (if the merge is trivial) the result. + +If multiple cases apply, the one used is listed first. + +A result of "no merge" means that index is left in stage 0, ancest in +stage 1, head in stage 2, and remote in stage 3 (if any of these are +empty, no entry is left for that stage). Otherwise, the given entry is +left in stage 0, and there are no other entries. + +A result of "no merge" is an error if the index is not empty and not +up to date. + +*empty* means that the tree must not have a directory-file conflict + with the entry. + +For multiple ancestors, a '+' means that this case applies even if +only one ancestor or remote fits; a '^' means all of the ancestors +must be the same. + + case ancest head remote result + ---------------------------------------- + 1 (empty)+ (empty) (empty) (empty) + 2ALT (empty)+ *empty* remote remote + 2 (empty)^ (empty) remote no merge + 3ALT (empty)+ head *empty* head + 3 (empty)^ head (empty) no merge + 4 (empty)^ head remote no merge + 5ALT * head head head + 6 ancest+ (empty) (empty) no merge + 8 ancest^ (empty) ancest no merge + 7 ancest+ (empty) remote no merge + 10 ancest^ ancest (empty) no merge + 9 ancest+ head (empty) no merge + 16 anc1/anc2 anc1 anc2 no merge + 13 ancest+ head ancest head + 14 ancest+ ancest remote remote + 11 ancest+ head remote no merge + +Only #2ALT and #3ALT use *empty*, because these are the only cases +where there can be conflicts that didn't exist before. Note that we +allow directory-file conflicts between things in different stages +after the trivial merge. + +A possible alternative for #6 is (empty), which would make it like +#1. This is not used, due to the likelihood that it arises due to +moving the file to multiple different locations or moving and deleting +it in different branches. + +Case #1 is included for completeness, and also in case we decide to +put on '+' markings; any path that is never mentioned at all isn't +handled. + +Note that #16 is when both #13 and #14 apply; in this case, we refuse +the trivial merge, because we can't tell from this data which is +right. This is a case of a reverted patch (in some direction, maybe +multiple times), and the right answer depends on looking at crossings +of history or common ancestors of the ancestors. + +Note that, between #6, #7, #9, and #11, all cases not otherwise +covered are handled in this table. + +For #8 and #10, there is alternative behavior, not currently +implemented, where the result is (empty). As currently implemented, +the automatic merge will generally give this effect. |