about summary refs log tree commit diff
path: root/tvix/castore/docs/why-not-git-trees.md
blob: fd46252cf55c1d33fc71898990b331053301225e (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
## Why not git tree objects?

We've been experimenting with (some variations of) the git tree and object
format, and ultimately decided against using it as an internal format, and
instead adapted the one documented in the other documents here.

While the tvix-store API protocol shares some similarities with the format used
in git for trees and objects, the git one has shown some significant
disadvantages:

### The binary encoding itself

#### trees
The git tree object format is a very binary, error-prone and
"made-to-be-read-and-written-from-C" format.

Tree objects are a combination of null-terminated strings, and fields of known
length. References to other tree objects use the literal sha1 hash of another
tree object in this encoding.
Extensions of the format/changes are very hard to do right, because parsers are
not aware they might be parsing something different.

The tvix-store protocol uses a canonical protobuf serialization, and uses
the [blake3][blake3] hash of that serialization to point to other `Directory`
messages.
It's both compact and with a wide range of libraries for encoders and decoders
in many programming languages.
The choice of protobuf makes it easy to add new fields, and make old clients
aware of some unknown fields being detected [^adding-fields].

#### blob
On disk, git blob objects start with a "blob" prefix, then the size of the
payload, and then the data itself. The hash of a blob is the literal sha1sum
over all of this - which makes it something very git specific to request for.

tvix-store simply uses the [blake3][blake3] hash of the literal contents
when referring to a file/blob, which makes it very easy to ask other data
sources for the same data, as no git-specific payload is included in the hash.
This also plays very well together with things like [iroh][iroh-discussion],
which plans to provide a way to substitute (large)blobs by their blake3 hash
over the IPFS network.

In addition to that, [blake3][blake3] makes it possible to do
[verified streaming][bao], as already described in other parts of the
documentation.

The git tree object format uses sha1 both for references to other trees and
hashes of blobs, which isn't really a hash function to fundamentally base
everything on in 2023.
The [migration to sha256][git-sha256] also has been dead for some years now,
and it's unclear how a "blake3" version of this would even look like.

[bao]: https://github.com/oconnor663/bao
[blake3]: https://github.com/BLAKE3-team/BLAKE3
[git-sha256]: https://git-scm.com/docs/hash-function-transition/
[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.