about summary refs log tree commit diff
path: root/third_party/git/Documentation/technical/partial-clone.txt
diff options
context:
space:
mode:
Diffstat (limited to 'third_party/git/Documentation/technical/partial-clone.txt')
-rw-r--r--third_party/git/Documentation/technical/partial-clone.txt324
1 files changed, 0 insertions, 324 deletions
diff --git a/third_party/git/Documentation/technical/partial-clone.txt b/third_party/git/Documentation/technical/partial-clone.txt
deleted file mode 100644
index 896c7b3878..0000000000
--- a/third_party/git/Documentation/technical/partial-clone.txt
+++ /dev/null
@@ -1,324 +0,0 @@
-Partial Clone Design Notes
-==========================
-
-The "Partial Clone" feature is a performance optimization for Git that
-allows Git to function without having a complete copy of the repository.
-The goal of this work is to allow Git better handle extremely large
-repositories.
-
-During clone and fetch operations, Git downloads the complete contents
-and history of the repository.  This includes all commits, trees, and
-blobs for the complete life of the repository.  For extremely large
-repositories, clones can take hours (or days) and consume 100+GiB of disk
-space.
-
-Often in these repositories there are many blobs and trees that the user
-does not need such as:
-
-  1. files outside of the user's work area in the tree.  For example, in
-     a repository with 500K directories and 3.5M files in every commit,
-     we can avoid downloading many objects if the user only needs a
-     narrow "cone" of the source tree.
-
-  2. large binary assets.  For example, in a repository where large build
-     artifacts are checked into the tree, we can avoid downloading all
-     previous versions of these non-mergeable binary assets and only
-     download versions that are actually referenced.
-
-Partial clone allows us to avoid downloading such unneeded objects *in
-advance* during clone and fetch operations and thereby reduce download
-times and disk usage.  Missing objects can later be "demand fetched"
-if/when needed.
-
-Use of partial clone requires that the user be online and the origin
-remote be available for on-demand fetching of missing objects.  This may
-or may not be problematic for the user.  For example, if the user can
-stay within the pre-selected subset of the source tree, they may not
-encounter any missing objects.  Alternatively, the user could try to
-pre-fetch various objects if they know that they are going offline.
-
-
-Non-Goals
----------
-
-Partial clone is a mechanism to limit the number of blobs and trees downloaded
-*within* a given range of commits -- and is therefore independent of and not
-intended to conflict with existing DAG-level mechanisms to limit the set of
-requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
-
-
-Design Overview
----------------
-
-Partial clone logically consists of the following parts:
-
-- A mechanism for the client to describe unneeded or unwanted objects to
-  the server.
-
-- A mechanism for the server to omit such unwanted objects from packfiles
-  sent to the client.
-
-- A mechanism for the client to gracefully handle missing objects (that
-  were previously omitted by the server).
-
-- A mechanism for the client to backfill missing objects as needed.
-
-
-Design Details
---------------
-
-- A new pack-protocol capability "filter" is added to the fetch-pack and
-  upload-pack negotiation.
-+
-This uses the existing capability discovery mechanism.
-See "filter" in Documentation/technical/pack-protocol.txt.
-
-- Clients pass a "filter-spec" to clone and fetch which is passed to the
-  server to request filtering during packfile construction.
-+
-There are various filters available to accommodate different situations.
-See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
-
-- On the server pack-objects applies the requested filter-spec as it
-  creates "filtered" packfiles for the client.
-+
-These filtered packfiles are *incomplete* in the traditional sense because
-they may contain objects that reference objects not contained in the
-packfile and that the client doesn't already have.  For example, the
-filtered packfile may contain trees or tags that reference missing blobs
-or commits that reference missing trees.
-
-- On the client these incomplete packfiles are marked as "promisor packfiles"
-  and treated differently by various commands.
-
-- On the client a repository extension is added to the local config to
-  prevent older versions of git from failing mid-operation because of
-  missing objects that they cannot handle.
-  See "extensions.partialClone" in Documentation/technical/repository-version.txt"
-
-
-Handling Missing Objects
-------------------------
-
-- An object may be missing due to a partial clone or fetch, or missing due
-  to repository corruption.  To differentiate these cases, the local
-  repository specially indicates such filtered packfiles obtained from the
-  promisor remote as "promisor packfiles".
-+
-These promisor packfiles consist of a "<name>.promisor" file with
-arbitrary contents (like the "<name>.keep" files), in addition to
-their "<name>.pack" and "<name>.idx" files.
-
-- The local repository considers a "promisor object" to be an object that
-  it knows (to the best of its ability) that the promisor remote has promised
-  that it has, either because the local repository has that object in one of
-  its promisor packfiles, or because another promisor object refers to it.
-+
-When Git encounters a missing object, Git can see if it is a promisor object
-and handle it appropriately.  If not, Git can report a corruption.
-+
-This means that there is no need for the client to explicitly maintain an
-expensive-to-modify list of missing objects.[a]
-
-- Since almost all Git code currently expects any referenced object to be
-  present locally and because we do not want to force every command to do
-  a dry-run first, a fallback mechanism is added to allow Git to attempt
-  to dynamically fetch missing objects from the promisor remote.
-+
-When the normal object lookup fails to find an object, Git invokes
-fetch-object to try to get the object from the server and then retry
-the object lookup.  This allows objects to be "faulted in" without
-complicated prediction algorithms.
-+
-For efficiency reasons, no check as to whether the missing object is
-actually a promisor object is performed.
-+
-Dynamic object fetching tends to be slow as objects are fetched one at
-a time.
-
-- `checkout` (and any other command using `unpack-trees`) has been taught
-  to bulk pre-fetch all required missing blobs in a single batch.
-
-- `rev-list` has been taught to print missing objects.
-+
-This can be used by other commands to bulk prefetch objects.
-For example, a "git log -p A..B" may internally want to first do
-something like "git rev-list --objects --quiet --missing=print A..B"
-and prefetch those objects in bulk.
-
-- `fsck` has been updated to be fully aware of promisor objects.
-
-- `repack` in GC has been updated to not touch promisor packfiles at all,
-  and to only repack other objects.
-
-- The global variable "fetch_if_missing" is used to control whether an
-  object lookup will attempt to dynamically fetch a missing object or
-  report an error.
-+
-We are not happy with this global variable and would like to remove it,
-but that requires significant refactoring of the object code to pass an
-additional flag.  We hope that concurrent efforts to add an ODB API can
-encompass this.
-
-
-Fetching Missing Objects
-------------------------
-
-- Fetching of objects is done using the existing transport mechanism using
-  transport_fetch_refs(), setting a new transport option
-  TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
-  desired, not any object that they refer to.
-+
-Because some transports invoke fetch_pack() in the same process, fetch_pack()
-has been updated to not use any object flags when the corresponding argument
-(no_dependents) is set.
-
-- The local repository sends a request with the hashes of all requested
-  objects as "want" lines, and does not perform any packfile negotiation.
-  It then receives a packfile.
-
-- Because we are reusing the existing fetch-pack mechanism, fetching
-  currently fetches all objects referred to by the requested objects, even
-  though they are not necessary.
-
-
-Current Limitations
--------------------
-
-- The remote used for a partial clone (or the first partial fetch
-  following a regular clone) is marked as the "promisor remote".
-+
-We are currently limited to a single promisor remote and only that
-remote may be used for subsequent partial fetches.
-+
-We accept this limitation because we believe initial users of this
-feature will be using it on repositories with a strong single central
-server.
-
-- Dynamic object fetching will only ask the promisor remote for missing
-  objects.  We assume that the promisor remote has a complete view of the
-  repository and can satisfy all such requests.
-
-- Repack essentially treats promisor and non-promisor packfiles as 2
-  distinct partitions and does not mix them.  Repack currently only works
-  on non-promisor packfiles and loose objects.
-
-- Dynamic object fetching invokes fetch-pack once *for each item*
-  because most algorithms stumble upon a missing object and need to have
-  it resolved before continuing their work.  This may incur significant
-  overhead -- and multiple authentication requests -- if many objects are
-  needed.
-
-- Dynamic object fetching currently uses the existing pack protocol V0
-  which means that each object is requested via fetch-pack.  The server
-  will send a full set of info/refs when the connection is established.
-  If there are large number of refs, this may incur significant overhead.
-
-
-Future Work
------------
-
-- Allow more than one promisor remote and define a strategy for fetching
-  missing objects from specific promisor remotes or of iterating over the
-  set of promisor remotes until a missing object is found.
-+
-A user might want to have multiple geographically-close cache servers
-for fetching missing blobs while continuing to do filtered `git-fetch`
-commands from the central server, for example.
-+
-Or the user might want to work in a triangular work flow with multiple
-promisor remotes that each have an incomplete view of the repository.
-
-- Allow repack to work on promisor packfiles (while keeping them distinct
-  from non-promisor packfiles).
-
-- Allow non-pathname-based filters to make use of packfile bitmaps (when
-  present).  This was just an omission during the initial implementation.
-
-- Investigate use of a long-running process to dynamically fetch a series
-  of objects, such as proposed in [5,6] to reduce process startup and
-  overhead costs.
-+
-It would be nice if pack protocol V2 could allow that long-running
-process to make a series of requests over a single long-running
-connection.
-
-- Investigate pack protocol V2 to avoid the info/refs broadcast on
-  each connection with the server to dynamically fetch missing objects.
-
-- Investigate the need to handle loose promisor objects.
-+
-Objects in promisor packfiles are allowed to reference missing objects
-that can be dynamically fetched from the server.  An assumption was
-made that loose objects are only created locally and therefore should
-not reference a missing object.  We may need to revisit that assumption
-if, for example, we dynamically fetch a missing tree and store it as a
-loose object rather than a single object packfile.
-+
-This does not necessarily mean we need to mark loose objects as promisor;
-it may be sufficient to relax the object lookup or is-promisor functions.
-
-
-Non-Tasks
----------
-
-- Every time the subject of "demand loading blobs" comes up it seems
-  that someone suggests that the server be allowed to "guess" and send
-  additional objects that may be related to the requested objects.
-+
-No work has gone into actually doing that; we're just documenting that
-it is a common suggestion.  We're not sure how it would work and have
-no plans to work on it.
-+
-It is valid for the server to send more objects than requested (even
-for a dynamic object fetch), but we are not building on that.
-
-
-Footnotes
----------
-
-[a] expensive-to-modify list of missing objects:  Earlier in the design of
-    partial clone we discussed the need for a single list of missing objects.
-    This would essentially be a sorted linear list of OIDs that the were
-    omitted by the server during a clone or subsequent fetches.
-
-This file would need to be loaded into memory on every object lookup.
-It would need to be read, updated, and re-written (like the .git/index)
-on every explicit "git fetch" command *and* on any dynamic object fetch.
-
-The cost to read, update, and write this file could add significant
-overhead to every command if there are many missing objects.  For example,
-if there are 100M missing blobs, this file would be at least 2GiB on disk.
-
-With the "promisor" concept, we *infer* a missing object based upon the
-type of packfile that references it.
-
-
-Related Links
--------------
-[0] https://crbug.com/git/2
-    Bug#2: Partial Clone
-
-[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
-    Subject: [RFC] Add support for downloading blobs on demand +
-    Date: Fri, 13 Jan 2017 10:52:53 -0500
-
-[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
-    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
-    Date: Fri, 29 Sep 2017 13:11:36 -0700
-
-[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
-    Subject: Proposal for missing blob support in Git repos +
-    Date: Wed, 26 Apr 2017 15:13:46 -0700
-
-[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
-    Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
-    Date: Wed,  8 Mar 2017 18:50:29 +0000
-
-[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
-    Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
-    Date: Fri,  5 May 2017 11:27:52 -0400
-
-[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
-    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
-    Date: Fri, 14 Jul 2017 09:26:50 -0400