about summary refs log tree commit diff
path: root/tvix/castore/docs/blobstore-chunking.md
diff options
context:
space:
mode:
Diffstat (limited to 'tvix/castore/docs/blobstore-chunking.md')
-rw-r--r--tvix/castore/docs/blobstore-chunking.md147
1 files changed, 0 insertions, 147 deletions
diff --git a/tvix/castore/docs/blobstore-chunking.md b/tvix/castore/docs/blobstore-chunking.md
deleted file mode 100644
index 49bbe6927554..000000000000
--- a/tvix/castore/docs/blobstore-chunking.md
+++ /dev/null
@@ -1,147 +0,0 @@
-# BlobStore: Chunking & Verified Streaming
-
-`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
-as hash function.
-
-Returned data is fetched by using the digest as lookup key, and can be verified
-to be correct by feeding the received data through the hash function and
-ensuring it matches the digest initially used for the lookup.
-
-This means, data can be downloaded by any untrusted third-party as well, as the
-received data is validated to match the digest it was originally requested with.
-
-However, for larger blobs of data, having to download the entire blob at once is
-wasteful, if we only care about a part of the blob. Think about mounting a
-seekable data structure, like loop-mounting an .iso file, or doing partial reads
-in a large Parquet file, a column-oriented data format.
-
-> We want to have the possibility to *seek* into a larger file.
-
-This however shouldn't compromise on data integrity properties - we should not
-need to trust a peer we're downloading from to be "honest" about the partial
-data we're reading. We should be able to verify smaller reads.
-
-Especially when substituting from an untrusted third-party, we want to be able
-to detect quickly if that third-party is sending us wrong data, and terminate
-the connection early.
-
-## Chunking
-In content-addressed systems, this problem has historically been solved by
-breaking larger blobs into smaller chunks, which can be fetched individually,
-and making a hash of *this listing* the blob digest/identifier.
-
- - BitTorrent for example breaks files up into smaller chunks, and maintains
-   a list of sha1 digests for each of these chunks. Magnet links contain a
-   digest over this listing as an identifier. (See [bittorrent-v2][here for
-   more details]).
-   With the identifier, a client can fetch the entire list, and then recursively
-   "unpack the graph" of nodes, until it ends up with a list of individual small
-   chunks, which can be fetched individually.
- - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
-   digest of the root node as an identitier.
-
-These approaches solve the problem of being able to fetch smaller chunks in a
-trusted fashion. They can also do some deduplication, in case there's the same
-leaf nodes same leaf nodes in multiple places.
-
-However, they also have a big disadvantage. The chunking parameters, and the
-"topology" of the graph structure itself "bleed" into the root hash of the
-entire data structure itself.
-
-Depending on the chunking parameters used, there's different representations for
-the same data, causing less data sharing/reuse in the overall system, in terms of how
-many chunks need to be downloaded vs. are already available locally, as well as
-how compact data is stored on-disk.
-
-This can be workarounded by agreeing on only a single way of chunking, but it's
-not pretty and misses a lot of deduplication potential.
-
-### Chunking in Tvix' Blobstore
-tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
-disadvantages, while still being content-addressed internally, with the
-highlighted benefits.
-
-It uses [blake3] as hash function, and the blake3 digest of **the raw data
-itself** as an identifier (rather than some application-specific Merkle DAG that
-also embeds some chunking information).
-
-BLAKE3 is a tree hash where all left nodes fully populated, contrary to
-conventional serial hash functions. To be able to validate the hash of a node,
-one only needs the hash of the (2) children [^1], if any.
-
-This means one only needs to the root digest to validate a constructions, and these
-constructions can be sent [separately][bao-spec].
-
-This relieves us from the need of having to encode more granular chunking into
-our data model / identifier upfront, but can make this a mostly a transport/
-storage concern.
-
-For the some more description on the (remote) protocol, check
-`./blobstore-protocol.md`.
-
-#### Logical vs. physical chunking
-
-Due to the properties of the BLAKE3 hash function, we have logical blocks of
-1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
-chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
-or are stored on-disk.
-
-The only thing we need to be able to read and verify an arbitrary byte range is
-having the covering range of aligned 1K blocks, and a construction from the root
-digest to the 1K block.
-
-Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
-lower parts of the tree while still providing verified streaming - at the cost
-of having to fetch larger covering ranges of aligned blocks.
-
-Let's pick an example. We identify each KiB by a number here for illustrational
-purposes.
-
-Assuming we omit the last two layers of the hash tree, we end up with logical
-4KiB leaf chunks (`bao_shift` of `2`).
-
-For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
-`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
-
-`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
-
-Assuming the server now informs us about the following physical chunking:
-
-```
-[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
-```
-
-If our application now wants to arbitrarily read from 0 until 4 (inclusive):
-
-```
-[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
- |-------------|
-
-```
-
-…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
-
-
-`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
-interested in.
-
-We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
-can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
-
-```
-[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
-^       ^           ^     ^
-|----4KiB----|------4KiB-----|
-```
-
-Each physical chunk fetched can be validated to have the blake3 digest that was
-communicated upfront, and can be stored in a client-side cache/storage, so
-subsequent / other requests for the same data will be fast(er).
-
----
-
-[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
-[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
-[blake3]: https://github.com/BLAKE3-team/BLAKE3
-[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
-[bao-tree]: https://github.com/n0-computer/bao-tree