From 6019c75deb7868f43b9ad8449fe2d00532b07a54 Mon Sep 17 00:00:00 2001 From: Florian Klink Date: Sun, 5 Mar 2023 14:14:38 +0100 Subject: docs(tvix/store): add document describing why we don't use git trees This came up recently again, it makes sense to document the reasoning behind the decision. Change-Id: Ic51d5bc7998c70e8b070b6f42877d8e88613935b Reviewed-on: https://cl.tvl.fyi/c/depot/+/8223 Reviewed-by: raitobezarius Tested-by: BuildkiteCI Autosubmit: flokli --- tvix/store/docs/why-not-git-trees.md | 57 ++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 tvix/store/docs/why-not-git-trees.md diff --git a/tvix/store/docs/why-not-git-trees.md b/tvix/store/docs/why-not-git-trees.md new file mode 100644 index 000000000000..fd46252cf55c --- /dev/null +++ b/tvix/store/docs/why-not-git-trees.md @@ -0,0 +1,57 @@ +## Why not git tree objects? + +We've been experimenting with (some variations of) the git tree and object +format, and ultimately decided against using it as an internal format, and +instead adapted the one documented in the other documents here. + +While the tvix-store API protocol shares some similarities with the format used +in git for trees and objects, the git one has shown some significant +disadvantages: + +### The binary encoding itself + +#### trees +The git tree object format is a very binary, error-prone and +"made-to-be-read-and-written-from-C" format. + +Tree objects are a combination of null-terminated strings, and fields of known +length. References to other tree objects use the literal sha1 hash of another +tree object in this encoding. +Extensions of the format/changes are very hard to do right, because parsers are +not aware they might be parsing something different. + +The tvix-store protocol uses a canonical protobuf serialization, and uses +the [blake3][blake3] hash of that serialization to point to other `Directory` +messages. +It's both compact and with a wide range of libraries for encoders and decoders +in many programming languages. +The choice of protobuf makes it easy to add new fields, and make old clients +aware of some unknown fields being detected [^adding-fields]. + +#### blob +On disk, git blob objects start with a "blob" prefix, then the size of the +payload, and then the data itself. The hash of a blob is the literal sha1sum +over all of this - which makes it something very git specific to request for. + +tvix-store simply uses the [blake3][blake3] hash of the literal contents +when referring to a file/blob, which makes it very easy to ask other data +sources for the same data, as no git-specific payload is included in the hash. +This also plays very well together with things like [iroh][iroh-discussion], +which plans to provide a way to substitute (large)blobs by their blake3 hash +over the IPFS network. + +In addition to that, [blake3][blake3] makes it possible to do +[verified streaming][bao], as already described in other parts of the +documentation. + +The git tree object format uses sha1 both for references to other trees and +hashes of blobs, which isn't really a hash function to fundamentally base +everything on in 2023. +The [migration to sha256][git-sha256] also has been dead for some years now, +and it's unclear how a "blake3" version of this would even look like. + +[bao]: https://github.com/oconnor663/bao +[blake3]: https://github.com/BLAKE3-team/BLAKE3 +[git-sha256]: https://git-scm.com/docs/hash-function-transition/ +[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197 +[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect. \ No newline at end of file -- cgit 1.4.1