diff options
Diffstat (limited to 'third_party/git/Documentation/technical/racy-git.txt')
-rw-r--r-- | third_party/git/Documentation/technical/racy-git.txt | 201 |
1 files changed, 0 insertions, 201 deletions
diff --git a/third_party/git/Documentation/technical/racy-git.txt b/third_party/git/Documentation/technical/racy-git.txt deleted file mode 100644 index ceda4bbfda4d..000000000000 --- a/third_party/git/Documentation/technical/racy-git.txt +++ /dev/null @@ -1,201 +0,0 @@ -Use of index and Racy Git problem -================================= - -Background ----------- - -The index is one of the most important data structures in Git. -It represents a virtual working tree state by recording list of -paths and their object names and serves as a staging area to -write out the next tree object to be committed. The state is -"virtual" in the sense that it does not necessarily have to, and -often does not, match the files in the working tree. - -There are cases Git needs to examine the differences between the -virtual working tree state in the index and the files in the -working tree. The most obvious case is when the user asks `git -diff` (or its low level implementation, `git diff-files`) or -`git-ls-files --modified`. In addition, Git internally checks -if the files in the working tree are different from what are -recorded in the index to avoid stomping on local changes in them -during patch application, switching branches, and merging. - -In order to speed up this comparison between the files in the -working tree and the index entries, the index entries record the -information obtained from the filesystem via `lstat(2)` system -call when they were last updated. When checking if they differ, -Git first runs `lstat(2)` on the files and compares the result -with this information (this is what was originally done by the -`ce_match_stat()` function, but the current code does it in -`ce_match_stat_basic()` function). If some of these "cached -stat information" fields do not match, Git can tell that the -files are modified without even looking at their contents. - -Note: not all members in `struct stat` obtained via `lstat(2)` -are used for this comparison. For example, `st_atime` obviously -is not useful. Currently, Git compares the file type (regular -files vs symbolic links) and executable bits (only for regular -files) from `st_mode` member, `st_mtime` and `st_ctime` -timestamps, `st_uid`, `st_gid`, `st_ino`, and `st_size` members. -With a `USE_STDEV` compile-time option, `st_dev` is also -compared, but this is not enabled by default because this member -is not stable on network filesystems. With `USE_NSEC` -compile-time option, `st_mtim.tv_nsec` and `st_ctim.tv_nsec` -members are also compared. On Linux, this is not enabled by default -because in-core timestamps can have finer granularity than -on-disk timestamps, resulting in meaningless changes when an -inode is evicted from the inode cache. See commit 8ce13b0 -of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git -([PATCH] Sync in core time granularity with filesystems, -2005-01-04). This patch is included in kernel 2.6.11 and newer, but -only fixes the issue for file systems with exactly 1 ns or 1 s -resolution. Other file systems are still broken in current Linux -kernels (e.g. CEPH, CIFS, NTFS, UDF), see -https://lore.kernel.org/lkml/5577240D.7020309@gmail.com/ - -Racy Git --------- - -There is one slight problem with the optimization based on the -cached stat information. Consider this sequence: - - : modify 'foo' - $ git update-index 'foo' - : modify 'foo' again, in-place, without changing its size - -The first `update-index` computes the object name of the -contents of file `foo` and updates the index entry for `foo` -along with the `struct stat` information. If the modification -that follows it happens very fast so that the file's `st_mtime` -timestamp does not change, after this sequence, the cached stat -information the index entry records still exactly match what you -would see in the filesystem, even though the file `foo` is now -different. -This way, Git can incorrectly think files in the working tree -are unmodified even though they actually are. This is called -the "racy Git" problem (discovered by Pasky), and the entries -that appear clean when they may not be because of this problem -are called "racily clean". - -To avoid this problem, Git does two things: - -. When the cached stat information says the file has not been - modified, and the `st_mtime` is the same as (or newer than) - the timestamp of the index file itself (which is the time `git - update-index foo` finished running in the above example), it - also compares the contents with the object registered in the - index entry to make sure they match. - -. When the index file is updated that contains racily clean - entries, cached `st_size` information is truncated to zero - before writing a new version of the index file. - -Because the index file itself is written after collecting all -the stat information from updated paths, `st_mtime` timestamp of -it is usually the same as or newer than any of the paths the -index contains. And no matter how quick the modification that -follows `git update-index foo` finishes, the resulting -`st_mtime` timestamp on `foo` cannot get a value earlier -than the index file. Therefore, index entries that can be -racily clean are limited to the ones that have the same -timestamp as the index file itself. - -The callers that want to check if an index entry matches the -corresponding file in the working tree continue to call -`ce_match_stat()`, but with this change, `ce_match_stat()` uses -`ce_modified_check_fs()` to see if racily clean ones are -actually clean after comparing the cached stat information using -`ce_match_stat_basic()`. - -The problem the latter solves is this sequence: - - $ git update-index 'foo' - : modify 'foo' in-place without changing its size - : wait for enough time - $ git update-index 'bar' - -Without the latter, the timestamp of the index file gets a newer -value, and falsely clean entry `foo` would not be caught by the -timestamp comparison check done with the former logic anymore. -The latter makes sure that the cached stat information for `foo` -would never match with the file in the working tree, so later -checks by `ce_match_stat_basic()` would report that the index entry -does not match the file and Git does not have to fall back on more -expensive `ce_modified_check_fs()`. - - -Runtime penalty ---------------- - -The runtime penalty of falling back to `ce_modified_check_fs()` -from `ce_match_stat()` can be very expensive when there are many -racily clean entries. An obvious way to artificially create -this situation is to give the same timestamp to all the files in -the working tree in a large project, run `git update-index` on -them, and give the same timestamp to the index file: - - $ date >.datestamp - $ git ls-files | xargs touch -r .datestamp - $ git ls-files | git update-index --stdin - $ touch -r .datestamp .git/index - -This will make all index entries racily clean. The linux project, for -example, there are over 20,000 files in the working tree. On my -Athlon 64 X2 3800+, after the above: - - $ /usr/bin/time git diff-files - 1.68user 0.54system 0:02.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k - 0inputs+0outputs (0major+67111minor)pagefaults 0swaps - $ git update-index MAINTAINERS - $ /usr/bin/time git diff-files - 0.02user 0.12system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k - 0inputs+0outputs (0major+935minor)pagefaults 0swaps - -Running `git update-index` in the middle checked the racily -clean entries, and left the cached `st_mtime` for all the paths -intact because they were actually clean (so this step took about -the same amount of time as the first `git diff-files`). After -that, they are not racily clean anymore but are truly clean, so -the second invocation of `git diff-files` fully took advantage -of the cached stat information. - - -Avoiding runtime penalty ------------------------- - -In order to avoid the above runtime penalty, post 1.4.2 Git used -to have a code that made sure the index file -got timestamp newer than the youngest files in the index when -there are many young files with the same timestamp as the -resulting index file would otherwise would have by waiting -before finishing writing the index file out. - -I suspected that in practice the situation where many paths in the -index are all racily clean was quite rare. The only code paths -that can record recent timestamp for large number of paths are: - -. Initial `git add .` of a large project. - -. `git checkout` of a large project from an empty index into an - unpopulated working tree. - -Note: switching branches with `git checkout` keeps the cached -stat information of existing working tree files that are the -same between the current branch and the new branch, which are -all older than the resulting index file, and they will not -become racily clean. Only the files that are actually checked -out can become racily clean. - -In a large project where raciness avoidance cost really matters, -however, the initial computation of all object names in the -index takes more than one second, and the index file is written -out after all that happens. Therefore the timestamp of the -index file will be more than one seconds later than the -youngest file in the working tree. This means that in these -cases there actually will not be any racily clean entry in -the resulting index. - -Based on this discussion, the current code does not use the -"workaround" to avoid the runtime penalty that does not exist in -practice anymore. This was done with commit 0fc82cff on Aug 15, -2006. |