about summary refs log tree commit diff
path: root/tvix/castore/docs/why-not-git-trees.md
diff options
context:
space:
mode:
Diffstat (limited to 'tvix/castore/docs/why-not-git-trees.md')
-rw-r--r--tvix/castore/docs/why-not-git-trees.md57
1 files changed, 57 insertions, 0 deletions
diff --git a/tvix/castore/docs/why-not-git-trees.md b/tvix/castore/docs/why-not-git-trees.md
new file mode 100644
index 000000000000..4a12b4ef5554
--- /dev/null
+++ b/tvix/castore/docs/why-not-git-trees.md
@@ -0,0 +1,57 @@
+## Why not git tree objects?
+
+We've been experimenting with (some variations of) the git tree and object
+format, and ultimately decided against using it as an internal format, and
+instead adapted the one documented in the other documents here.
+
+While the tvix-store API protocol shares some similarities with the format used
+in git for trees and objects, the git one has shown some significant
+disadvantages:
+
+### The binary encoding itself
+
+#### trees
+The git tree object format is a very binary, error-prone and
+"made-to-be-read-and-written-from-C" format.
+
+Tree objects are a combination of null-terminated strings, and fields of known
+length. References to other tree objects use the literal sha1 hash of another
+tree object in this encoding.
+Extensions of the format/changes are very hard to do right, because parsers are
+not aware they might be parsing something different.
+
+The tvix-store protocol uses a canonical protobuf serialization, and uses
+the [blake3][blake3] hash of that serialization to point to other `Directory`
+messages.
+It's both compact and with a wide range of libraries for encoders and decoders
+in many programming languages.
+The choice of protobuf makes it easy to add new fields, and make old clients
+aware of some unknown fields being detected [^adding-fields].
+
+#### blob
+On disk, git blob objects start with a "blob" prefix, then the size of the
+payload, and then the data itself. The hash of a blob is the literal sha1sum
+over all of this - which makes it something very git specific to request for.
+
+tvix-store simply uses the [blake3][blake3] hash of the literal contents
+when referring to a file/blob, which makes it very easy to ask other data
+sources for the same data, as no git-specific payload is included in the hash.
+This also plays very well together with things like [iroh][iroh-discussion],
+which plans to provide a way to substitute (large)blobs by their blake3 hash
+over the IPFS network.
+
+In addition to that, [blake3][blake3] makes it possible to do
+[verified streaming][bao], as already described in other parts of the
+documentation.
+
+The git tree object format uses sha1 both for references to other trees and
+hashes of blobs, which isn't really a hash function to fundamentally base
+everything on in 2023.
+The [migration to sha256][git-sha256] also has been dead for some years now,
+and it's unclear what a "blake3" version of this would even look like.
+
+[bao]: https://github.com/oconnor663/bao
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[git-sha256]: https://git-scm.com/docs/hash-function-transition/
+[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
+[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.
\ No newline at end of file