diff options
Diffstat (limited to 'third_party/git/Documentation/technical/hash-function-transition.txt')
-rw-r--r-- | third_party/git/Documentation/technical/hash-function-transition.txt | 827 |
1 files changed, 0 insertions, 827 deletions
diff --git a/third_party/git/Documentation/technical/hash-function-transition.txt b/third_party/git/Documentation/technical/hash-function-transition.txt deleted file mode 100644 index 6fd20ebbc254..000000000000 --- a/third_party/git/Documentation/technical/hash-function-transition.txt +++ /dev/null @@ -1,827 +0,0 @@ -Git hash function transition -============================ - -Objective ---------- -Migrate Git from SHA-1 to a stronger hash function. - -Background ----------- -At its core, the Git version control system is a content addressable -filesystem. It uses the SHA-1 hash function to name content. For -example, files, directories, and revisions are referred to by hash -values unlike in other traditional version control systems where files -or versions are referred to via sequential numbers. The use of a hash -function to address its content delivers a few advantages: - -* Integrity checking is easy. Bit flips, for example, are easily - detected, as the hash of corrupted content does not match its name. -* Lookup of objects is fast. - -Using a cryptographically secure hash function brings additional -advantages: - -* Object names can be signed and third parties can trust the hash to - address the signed object and all objects it references. -* Communication using Git protocol and out of band communication - methods have a short reliable string that can be used to reliably - address stored content. - -Over time some flaws in SHA-1 have been discovered by security -researchers. On 23 February 2017 the SHAttered attack -(https://shattered.io) demonstrated a practical SHA-1 hash collision. - -Git v2.13.0 and later subsequently moved to a hardened SHA-1 -implementation by default, which isn't vulnerable to the SHAttered -attack. - -Thus Git has in effect already migrated to a new hash that isn't SHA-1 -and doesn't share its vulnerabilities, its new hash function just -happens to produce exactly the same output for all known inputs, -except two PDFs published by the SHAttered researchers, and the new -implementation (written by those researchers) claims to detect future -cryptanalytic collision attacks. - -Regardless, it's considered prudent to move past any variant of SHA-1 -to a new hash. There's no guarantee that future attacks on SHA-1 won't -be published in the future, and those attacks may not have viable -mitigations. - -If SHA-1 and its variants were to be truly broken, Git's hash function -could not be considered cryptographically secure any more. This would -impact the communication of hash values because we could not trust -that a given hash value represented the known good version of content -that the speaker intended. - -SHA-1 still possesses the other properties such as fast object lookup -and safe error checking, but other hash functions are equally suitable -that are believed to be cryptographically secure. - -Goals ------ -1. The transition to SHA-256 can be done one local repository at a time. - a. Requiring no action by any other party. - b. A SHA-256 repository can communicate with SHA-1 Git servers - (push/fetch). - c. Users can use SHA-1 and SHA-256 identifiers for objects - interchangeably (see "Object names on the command line", below). - d. New signed objects make use of a stronger hash function than - SHA-1 for their security guarantees. -2. Allow a complete transition away from SHA-1. - a. Local metadata for SHA-1 compatibility can be removed from a - repository if compatibility with SHA-1 is no longer needed. -3. Maintainability throughout the process. - a. The object format is kept simple and consistent. - b. Creation of a generalized repository conversion tool. - -Non-Goals ---------- -1. Add SHA-256 support to Git protocol. This is valuable and the - logical next step but it is out of scope for this initial design. -2. Transparently improving the security of existing SHA-1 signed - objects. -3. Intermixing objects using multiple hash functions in a single - repository. -4. Taking the opportunity to fix other bugs in Git's formats and - protocols. -5. Shallow clones and fetches into a SHA-256 repository. (This will - change when we add SHA-256 support to Git protocol.) -6. Skip fetching some submodules of a project into a SHA-256 - repository. (This also depends on SHA-256 support in Git - protocol.) - -Overview --------- -We introduce a new repository format extension. Repositories with this -extension enabled use SHA-256 instead of SHA-1 to name their objects. -This affects both object names and object content --- both the names -of objects and all references to other objects within an object are -switched to the new hash function. - -SHA-256 repositories cannot be read by older versions of Git. - -Alongside the packfile, a SHA-256 repository stores a bidirectional -mapping between SHA-256 and SHA-1 object names. The mapping is generated -locally and can be verified using "git fsck". Object lookups use this -mapping to allow naming objects using either their SHA-1 and SHA-256 names -interchangeably. - -"git cat-file" and "git hash-object" gain options to display an object -in its sha1 form and write an object given its sha1 form. This -requires all objects referenced by that object to be present in the -object database so that they can be named using the appropriate name -(using the bidirectional hash mapping). - -Fetches from a SHA-1 based server convert the fetched objects into -SHA-256 form and record the mapping in the bidirectional mapping table -(see below for details). Pushes to a SHA-1 based server convert the -objects being pushed into sha1 form so the server does not have to be -aware of the hash function the client is using. - -Detailed Design ---------------- -Repository format extension -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A SHA-256 repository uses repository format version `1` (see -Documentation/technical/repository-version.txt) with extensions -`objectFormat` and `compatObjectFormat`: - - [core] - repositoryFormatVersion = 1 - [extensions] - objectFormat = sha256 - compatObjectFormat = sha1 - -The combination of setting `core.repositoryFormatVersion=1` and -populating `extensions.*` ensures that all versions of Git later than -`v0.99.9l` will die instead of trying to operate on the SHA-256 -repository, instead producing an error message. - - # Between v0.99.9l and v2.7.0 - $ git status - fatal: Expected git repo version <= 0, found 1 - # After v2.7.0 - $ git status - fatal: unknown repository extensions found: - objectformat - compatobjectformat - -See the "Transition plan" section below for more details on these -repository extensions. - -Object names -~~~~~~~~~~~~ -Objects can be named by their 40 hexadecimal digit sha1-name or 64 -hexadecimal digit sha256-name, plus names derived from those (see -gitrevisions(7)). - -The sha1-name of an object is the SHA-1 of the concatenation of its -type, length, a nul byte, and the object's sha1-content. This is the -traditional <sha1> used in Git to name objects. - -The sha256-name of an object is the SHA-256 of the concatenation of its -type, length, a nul byte, and the object's sha256-content. - -Object format -~~~~~~~~~~~~~ -The content as a byte sequence of a tag, commit, or tree object named -by sha1 and sha256 differ because an object named by sha256-name refers to -other objects by their sha256-names and an object named by sha1-name -refers to other objects by their sha1-names. - -The sha256-content of an object is the same as its sha1-content, except -that objects referenced by the object are named using their sha256-names -instead of sha1-names. Because a blob object does not refer to any -other object, its sha1-content and sha256-content are the same. - -The format allows round-trip conversion between sha256-content and -sha1-content. - -Object storage -~~~~~~~~~~~~~~ -Loose objects use zlib compression and packed objects use the packed -format described in Documentation/technical/pack-format.txt, just like -today. The content that is compressed and stored uses sha256-content -instead of sha1-content. - -Pack index -~~~~~~~~~~ -Pack index (.idx) files use a new v3 format that supports multiple -hash functions. They have the following format (all integers are in -network byte order): - -- A header appears at the beginning and consists of the following: - - The 4-byte pack index signature: '\377t0c' - - 4-byte version number: 3 - - 4-byte length of the header section, including the signature and - version number - - 4-byte number of objects contained in the pack - - 4-byte number of object formats in this pack index: 2 - - For each object format: - - 4-byte format identifier (e.g., 'sha1' for SHA-1) - - 4-byte length in bytes of shortened object names. This is the - shortest possible length needed to make names in the shortened - object name table unambiguous. - - 4-byte integer, recording where tables relating to this format - are stored in this index file, as an offset from the beginning. - - 4-byte offset to the trailer from the beginning of this file. - - Zero or more additional key/value pairs (4-byte key, 4-byte - value). Only one key is supported: 'PSRC'. See the "Loose objects - and unreachable objects" section for supported values and how this - is used. All other keys are reserved. Readers must ignore - unrecognized keys. -- Zero or more NUL bytes. This can optionally be used to improve the - alignment of the full object name table below. -- Tables for the first object format: - - A sorted table of shortened object names. These are prefixes of - the names of all objects in this pack file, packed together - without offset values to reduce the cache footprint of the binary - search for a specific object name. - - - A table of full object names in pack order. This allows resolving - a reference to "the nth object in the pack file" (from a - reachability bitmap or from the next table of another object - format) to its object name. - - - A table of 4-byte values mapping object name order to pack order. - For an object in the table of sorted shortened object names, the - value at the corresponding index in this table is the index in the - previous table for that same object. - - This can be used to look up the object in reachability bitmaps or - to look up its name in another object format. - - - A table of 4-byte CRC32 values of the packed object data, in the - order that the objects appear in the pack file. This is to allow - compressed data to be copied directly from pack to pack during - repacking without undetected data corruption. - - - A table of 4-byte offset values. For an object in the table of - sorted shortened object names, the value at the corresponding - index in this table indicates where that object can be found in - the pack file. These are usually 31-bit pack file offsets, but - large offsets are encoded as an index into the next table with the - most significant bit set. - - - A table of 8-byte offset entries (empty for pack files less than - 2 GiB). Pack files are organized with heavily used objects toward - the front, so most object references should not need to refer to - this table. -- Zero or more NUL bytes. -- Tables for the second object format, with the same layout as above, - up to and not including the table of CRC32 values. -- Zero or more NUL bytes. -- The trailer consists of the following: - - A copy of the 20-byte SHA-256 checksum at the end of the - corresponding packfile. - - - 20-byte SHA-256 checksum of all of the above. - -Loose object index -~~~~~~~~~~~~~~~~~~ -A new file $GIT_OBJECT_DIR/loose-object-idx contains information about -all loose objects. Its format is - - # loose-object-idx - (sha256-name SP sha1-name LF)* - -where the object names are in hexadecimal format. The file is not -sorted. - -The loose object index is protected against concurrent writes by a -lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose -object: - -1. Write the loose object to a temporary file, like today. -2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. -3. Rename the loose object into place. -4. Open loose-object-idx with O_APPEND and write the new object -5. Unlink loose-object-idx.lock to release the lock. - -To remove entries (e.g. in "git pack-refs" or "git-prune"): - -1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the - lock. -2. Write the new content to loose-object-idx.lock. -3. Unlink any loose objects being removed. -4. Rename to replace loose-object-idx, releasing the lock. - -Translation table -~~~~~~~~~~~~~~~~~ -The index files support a bidirectional mapping between sha1-names -and sha256-names. The lookup proceeds similarly to ordinary object -lookups. For example, to convert a sha1-name to a sha256-name: - - 1. Look for the object in idx files. If a match is present in the - idx's sorted list of truncated sha1-names, then: - a. Read the corresponding entry in the sha1-name order to pack - name order mapping. - b. Read the corresponding entry in the full sha1-name table to - verify we found the right object. If it is, then - c. Read the corresponding entry in the full sha256-name table. - That is the object's sha256-name. - 2. Check for a loose object. Read lines from loose-object-idx until - we find a match. - -Step (1) takes the same amount of time as an ordinary object lookup: -O(number of packs * log(objects per pack)). Step (2) takes O(number of -loose objects) time. To maintain good performance it will be necessary -to keep the number of loose objects low. See the "Loose objects and -unreachable objects" section below for more details. - -Since all operations that make new objects (e.g., "git commit") add -the new objects to the corresponding index, this mapping is possible -for all objects in the object store. - -Reading an object's sha1-content -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The sha1-content of an object can be read by converting all sha256-names -its sha256-content references to sha1-names using the translation table. - -Fetch -~~~~~ -Fetching from a SHA-1 based server requires translating between SHA-1 -and SHA-256 based representations on the fly. - -SHA-1s named in the ref advertisement that are present on the client -can be translated to SHA-256 and looked up as local objects using the -translation table. - -Negotiation proceeds as today. Any "have"s generated locally are -converted to SHA-1 before being sent to the server, and SHA-1s -mentioned by the server are converted to SHA-256 when looking them up -locally. - -After negotiation, the server sends a packfile containing the -requested objects. We convert the packfile to SHA-256 format using -the following steps: - -1. index-pack: inflate each object in the packfile and compute its - SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against - objects the client has locally. These objects can be looked up - using the translation table and their sha1-content read as - described above to resolve the deltas. -2. topological sort: starting at the "want"s from the negotiation - phase, walk through objects in the pack and emit a list of them, - excluding blobs, in reverse topologically sorted order, with each - object coming later in the list than all objects it references. - (This list only contains objects reachable from the "wants". If the - pack from the server contained additional extraneous objects, then - they will be discarded.) -3. convert to sha256: open a new (sha256) packfile. Read the topologically - sorted list just generated. For each object, inflate its - sha1-content, convert to sha256-content, and write it to the sha256 - pack. Record the new sha1<->sha256 mapping entry for use in the idx. -4. sort: reorder entries in the new pack to match the order of objects - in the pack the server generated and include blobs. Write a sha256 idx - file -5. clean up: remove the SHA-1 based pack file, index, and - topologically sorted list obtained from the server in steps 1 - and 2. - -Step 3 requires every object referenced by the new object to be in the -translation table. This is why the topological sort step is necessary. - -As an optimization, step 1 could write a file describing what non-blob -objects each object it has inflated from the packfile references. This -makes the topological sort in step 2 possible without inflating the -objects in the packfile for a second time. The objects need to be -inflated again in step 3, for a total of two inflations. - -Step 4 is probably necessary for good read-time performance. "git -pack-objects" on the server optimizes the pack file for good data -locality (see Documentation/technical/pack-heuristics.txt). - -Details of this process are likely to change. It will take some -experimenting to get this to perform well. - -Push -~~~~ -Push is simpler than fetch because the objects referenced by the -pushed objects are already in the translation table. The sha1-content -of each object being pushed can be read as described in the "Reading -an object's sha1-content" section to generate the pack written by git -send-pack. - -Signed Commits -~~~~~~~~~~~~~~ -We add a new field "gpgsig-sha256" to the commit object format to allow -signing commits without relying on SHA-1. It is similar to the -existing "gpgsig" field. Its signed payload is the sha256-content of the -commit object with any "gpgsig" and "gpgsig-sha256" fields removed. - -This means commits can be signed -1. using SHA-1 only, as in existing signed commit objects -2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig - fields. -3. using only SHA-256, by only using the gpgsig-sha256 field. - -Old versions of "git verify-commit" can verify the gpgsig signature in -cases (1) and (2) without modifications and view case (3) as an -ordinary unsigned commit. - -Signed Tags -~~~~~~~~~~~ -We add a new field "gpgsig-sha256" to the tag object format to allow -signing tags without relying on SHA-1. Its signed payload is the -sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP -SIGNATURE-----" delimited in-body signature removed. - -This means tags can be signed -1. using SHA-1 only, as in existing signed tag objects -2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body - signature. -3. using only SHA-256, by only using the gpgsig-sha256 field. - -Mergetag embedding -~~~~~~~~~~~~~~~~~~ -The mergetag field in the sha1-content of a commit contains the -sha1-content of a tag that was merged by that commit. - -The mergetag field in the sha256-content of the same commit contains the -sha256-content of the same tag. - -Submodules -~~~~~~~~~~ -To convert recorded submodule pointers, you need to have the converted -submodule repository in place. The translation table of the submodule -can be used to look up the new hash. - -Loose objects and unreachable objects -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Fast lookups in the loose-object-idx require that the number of loose -objects not grow too high. - -"git gc --auto" currently waits for there to be 6700 loose objects -present before consolidating them into a packfile. We will need to -measure to find a more appropriate threshold for it to use. - -"git gc --auto" currently waits for there to be 50 packs present -before combining packfiles. Packing loose objects more aggressively -may cause the number of pack files to grow too quickly. This can be -mitigated by using a strategy similar to Martin Fick's exponential -rolling garbage collection script: -https://gerrit-review.googlesource.com/c/gerrit/+/35215 - -"git gc" currently expels any unreachable objects it encounters in -pack files to loose objects in an attempt to prevent a race when -pruning them (in case another process is simultaneously writing a new -object that refers to the about-to-be-deleted object). This leads to -an explosion in the number of loose objects present and disk space -usage due to the objects in delta form being replaced with independent -loose objects. Worse, the race is still present for loose objects. - -Instead, "git gc" will need to move unreachable objects to a new -packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see -below). To avoid the race when writing new objects referring to an -about-to-be-deleted object, code paths that write new objects will -need to copy any objects from UNREACHABLE_GARBAGE packs that they -refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). -UNREACHABLE_GARBAGE are then safe to delete if their creation time (as -indicated by the file's mtime) is long enough ago. - -To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be -combined under certain circumstances. If "gc.garbageTtl" is set to -greater than one day, then packs created within a single calendar day, -UTC, can be coalesced together. The resulting packfile would have an -mtime before midnight on that day, so this makes the effective maximum -ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, -then we divide the calendar day into intervals one-third of that ttl -in duration. Packs created within the same interval can be coalesced -together. The resulting packfile would have an mtime before the end of -the interval, so this makes the effective maximum ttl equal to the -garbageTtl * 4/3. - -This rule comes from Thirumala Reddy Mutchukota's JGit change -https://git.eclipse.org/r/90465. - -The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack -index. More generally, that field indicates where a pack came from: - - - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network - - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight - "gc --auto" operation - - 3 (PACK_SOURCE_GC) for a pack created by a full gc - - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage - discovered by gc - - 5 (PACK_SOURCE_INSERT) for locally created objects that were - written directly to a pack file, e.g. from "git add ." - -This information can be useful for debugging and for "gc --auto" to -make appropriate choices about which packs to coalesce. - -Caveats -------- -Invalid objects -~~~~~~~~~~~~~~~ -The conversion from sha1-content to sha256-content retains any -brokenness in the original object (e.g., tree entry modes encoded with -leading 0, tree objects whose paths are not sorted correctly, and -commit objects without an author or committer). This is a deliberate -feature of the design to allow the conversion to round-trip. - -More profoundly broken objects (e.g., a commit with a truncated "tree" -header line) cannot be converted but were not usable by current Git -anyway. - -Shallow clone and submodules -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Because it requires all referenced objects to be available in the -locally generated translation table, this design does not support -shallow clone or unfetched submodules. Protocol improvements might -allow lifting this restriction. - -Alternates -~~~~~~~~~~ -For the same reason, a sha256 repository cannot borrow objects from a -sha1 repository using objects/info/alternates or -$GIT_ALTERNATE_OBJECT_REPOSITORIES. - -git notes -~~~~~~~~~ -The "git notes" tool annotates objects using their sha1-name as key. -This design does not describe a way to migrate notes trees to use -sha256-names. That migration is expected to happen separately (for -example using a file at the root of the notes tree to describe which -hash it uses). - -Server-side cost -~~~~~~~~~~~~~~~~ -Until Git protocol gains SHA-256 support, using SHA-256 based storage -on public-facing Git servers is strongly discouraged. Once Git -protocol gains SHA-256 support, SHA-256 based servers are likely not -to support SHA-1 compatibility, to avoid what may be a very expensive -hash re-encode during clone and to encourage peers to modernize. - -The design described here allows fetches by SHA-1 clients of a -personal SHA-256 repository because it's not much more difficult than -allowing pushes from that repository. This support needs to be guarded -by a configuration option --- servers like git.kernel.org that serve a -large number of clients would not be expected to bear that cost. - -Meaning of signatures -~~~~~~~~~~~~~~~~~~~~~ -The signed payload for signed commits and tags does not explicitly -name the hash used to identify objects. If some day Git adopts a new -hash function with the same length as the current SHA-1 (40 -hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the -intent behind the PGP signed payload in an object signature is -unclear: - - object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 - type commit - tag v2.12.0 - tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 - - Git 2.12 - -Does this mean Git v2.12.0 is the commit with sha1-name -e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with -new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? - -Fortunately SHA-256 and SHA-1 have different lengths. If Git starts -using another hash with the same length to name objects, then it will -need to change the format of signed payloads using that hash to -address this issue. - -Object names on the command line -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -To support the transition (see Transition plan below), this design -supports four different modes of operation: - - 1. ("dark launch") Treat object names input by the user as SHA-1 and - convert any object names written to output to SHA-1, but store - objects using SHA-256. This allows users to test the code with no - visible behavior change except for performance. This allows - allows running even tests that assume the SHA-1 hash function, to - sanity-check the behavior of the new mode. - - 2. ("early transition") Allow both SHA-1 and SHA-256 object names in - input. Any object names written to output use SHA-1. This allows - users to continue to make use of SHA-1 to communicate with peers - (e.g. by email) that have not migrated yet and prepares for mode 3. - - 3. ("late transition") Allow both SHA-1 and SHA-256 object names in - input. Any object names written to output use SHA-256. In this - mode, users are using a more secure object naming method by - default. The disruption is minimal as long as most of their peers - are in mode 2 or mode 3. - - 4. ("post-transition") Treat object names input by the user as - SHA-256 and write output using SHA-256. This is safer than mode 3 - because there is less risk that input is incorrectly interpreted - using the wrong hash function. - -The mode is specified in configuration. - -The user can also explicitly specify which format to use for a -particular revision specifier and for output, overriding the mode. For -example: - -git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} - -Choice of Hash --------------- -In early 2005, around the time that Git was written, Xiaoyun Wang, -Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 -collisions in 2^69 operations. In August they published details. -Luckily, no practical demonstrations of a collision in full SHA-1 were -published until 10 years later, in 2017. - -Git v2.13.0 and later subsequently moved to a hardened SHA-1 -implementation by default that mitigates the SHAttered attack, but -SHA-1 is still believed to be weak. - -The hash to replace this hardened SHA-1 should be stronger than SHA-1 -was: we would like it to be trustworthy and useful in practice for at -least 10 years. - -Some other relevant properties: - -1. A 256-bit hash (long enough to match common security practice; not - excessively long to hurt performance and disk usage). - -2. High quality implementations should be widely available (e.g., in - OpenSSL and Apple CommonCrypto). - -3. The hash function's properties should match Git's needs (e.g. Git - requires collision and 2nd preimage resistance and does not require - length extension resistance). - -4. As a tiebreaker, the hash should be fast to compute (fortunately - many contenders are faster than SHA-1). - -We choose SHA-256. - -Transition plan ---------------- -Some initial steps can be implemented independently of one another: -- adding a hash function API (vtable) -- teaching fsck to tolerate the gpgsig-sha256 field -- excluding gpgsig-* from the fields copied by "git commit --amend" -- annotating tests that depend on SHA-1 values with a SHA1 test - prerequisite -- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ - consistently instead of "unsigned char *" and the hardcoded - constants 20 and 40. -- introducing index v3 -- adding support for the PSRC field and safer object pruning - - -The first user-visible change is the introduction of the objectFormat -extension (without compatObjectFormat). This requires: -- teaching fsck about this mode of operation -- using the hash function API (vtable) when computing object names -- signing objects and verifying signatures -- rejecting attempts to fetch from or push to an incompatible - repository - -Next comes introduction of compatObjectFormat: -- implementing the loose-object-idx -- translating object names between object formats -- translating object content between object formats -- generating and verifying signatures in the compat format -- adding appropriate index entries when adding a new object to the - object store -- --output-format option -- ^{sha1} and ^{sha256} revision notation -- configuration to specify default input and output format (see - "Object names on the command line" above) - -The next step is supporting fetches and pushes to SHA-1 repositories: -- allow pushes to a repository using the compat format -- generate a topologically sorted list of the SHA-1 names of fetched - objects -- convert the fetched packfile to sha256 format and generate an idx - file -- re-sort to match the order of objects in the fetched packfile - -The infrastructure supporting fetch also allows converting an existing -repository. In converted repositories and new clones, end users can -gain support for the new hash function without any visible change in -behavior (see "dark launch" in the "Object names on the command line" -section). In particular this allows users to verify SHA-256 signatures -on objects in the repository, and it should ensure the transition code -is stable in production in preparation for using it more widely. - -Over time projects would encourage their users to adopt the "early -transition" and then "late transition" modes to take advantage of the -new, more futureproof SHA-256 object names. - -When objectFormat and compatObjectFormat are both set, commands -generating signatures would generate both SHA-1 and SHA-256 signatures -by default to support both new and old users. - -In projects using SHA-256 heavily, users could be encouraged to adopt -the "post-transition" mode to avoid accidentally making implicit use -of SHA-1 object names. - -Once a critical mass of users have upgraded to a version of Git that -can verify SHA-256 signatures and have converted their existing -repositories to support verifying them, we can add support for a -setting to generate only SHA-256 signatures. This is expected to be at -least a year later. - -That is also a good moment to advertise the ability to convert -repositories to use SHA-256 only, stripping out all SHA-1 related -metadata. This improves performance by eliminating translation -overhead and security by avoiding the possibility of accidentally -relying on the safety of SHA-1. - -Updating Git's protocols to allow a server to specify which hash -functions it supports is also an important part of this transition. It -is not discussed in detail in this document but this transition plan -assumes it happens. :) - -Alternatives considered ------------------------ -Upgrading everyone working on a particular project on a flag day -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Projects like the Linux kernel are large and complex enough that -flipping the switch for all projects based on the repository at once -is infeasible. - -Not only would all developers and server operators supporting -developers have to switch on the same flag day, but supporting tooling -(continuous integration, code review, bug trackers, etc) would have to -be adapted as well. This also makes it difficult to get early feedback -from some project participants testing before it is time for mass -adoption. - -Using hash functions in parallel -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) -Objects newly created would be addressed by the new hash, but inside -such an object (e.g. commit) it is still possible to address objects -using the old hash function. -* You cannot trust its history (needed for bisectability) in the - future without further work -* Maintenance burden as the number of supported hash functions grows - (they will never go away, so they accumulate). In this proposal, by - comparison, converted objects lose all references to SHA-1. - -Signed objects with multiple hashes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Instead of introducing the gpgsig-sha256 field in commit and tag objects -for sha256-content based signatures, an earlier version of this design -added "hash sha256 <sha256-name>" fields to strengthen the existing -sha1-content based signatures. - -In other words, a single signature was used to attest to the object -content using both hash functions. This had some advantages: -* Using one signature instead of two speeds up the signing process. -* Having one signed payload with both hashes allows the signer to - attest to the sha1-name and sha256-name referring to the same object. -* All users consume the same signature. Broken signatures are likely - to be detected quickly using current versions of git. - -However, it also came with disadvantages: -* Verifying a signed object requires access to the sha1-names of all - objects it references, even after the transition is complete and - translation table is no longer needed for anything else. To support - this, the design added fields such as "hash sha1 tree <sha1-name>" - and "hash sha1 parent <sha1-name>" to the sha256-content of a signed - commit, complicating the conversion process. -* Allowing signed objects without a sha1 (for after the transition is - complete) complicated the design further, requiring a "nohash sha1" - field to suppress including "hash sha1" fields in the sha256-content - and signed payload. - -Lazily populated translation table -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some of the work of building the translation table could be deferred to -push time, but that would significantly complicate and slow down pushes. -Calculating the sha1-name at object creation time at the same time it is -being streamed to disk and having its sha256-name calculated should be -an acceptable cost. - -Document History ----------------- - -2017-03-03 -bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, -sbeller@google.com - -Initial version sent to -http://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com - -2017-03-03 jrnieder@gmail.com -Incorporated suggestions from jonathantanmy and sbeller: -* describe purpose of signed objects with each hash type -* redefine signed object verification using object content under the - first hash function - -2017-03-06 jrnieder@gmail.com -* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] -* Make sha3-based signatures a separate field, avoiding the need for - "hash" and "nohash" fields (thanks to peff[3]). -* Add a sorting phase to fetch (thanks to Junio for noticing the need - for this). -* Omit blobs from the topological sort during fetch (thanks to peff). -* Discuss alternates, git notes, and git servers in the caveats - section (thanks to Junio Hamano, brian m. carlson[4], and Shawn - Pearce). -* Clarify language throughout (thanks to various commenters, - especially Junio). - -2017-09-27 jrnieder@gmail.com, sbeller@google.com -* use placeholder NewHash instead of SHA3-256 -* describe criteria for picking a hash function. -* include a transition plan (thanks especially to Brandon Williams - for fleshing these ideas out) -* define the translation table (thanks, Shawn Pearce[5], Jonathan - Tan, and Masaya Suzuki) -* avoid loose object overhead by packing more aggressively in - "git gc --auto" - -Later history: - - See the history of this file in git.git for the history of subsequent - edits. This document history is no longer being maintained as it - would now be superfluous to the commit log - -[1] http://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ -[2] http://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ -[3] http://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ -[4] http://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net -[5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ |