3 files changed, 251 insertions, 121 deletions
diff --git a/tvix/castore/docs/blobstore-chunking.md b/tvix/castore/docs/blobstore-chunking.md
new file mode 100644
index 0000000000..49bbe69275
--- /dev/null
+++ b/tvix/castore/docs/blobstore-chunking.md
@@ -0,0 +1,147 @@
+# BlobStore: Chunking & Verified Streaming
+
+`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
+as hash function.
+
+Returned data is fetched by using the digest as lookup key, and can be verified
+to be correct by feeding the received data through the hash function and
+ensuring it matches the digest initially used for the lookup.
+
+This means, data can be downloaded by any untrusted third-party as well, as the
+received data is validated to match the digest it was originally requested with.
+
+However, for larger blobs of data, having to download the entire blob at once is
+wasteful, if we only care about a part of the blob. Think about mounting a
+seekable data structure, like loop-mounting an .iso file, or doing partial reads
+in a large Parquet file, a column-oriented data format.
+
+> We want to have the possibility to *seek* into a larger file.
+
+This however shouldn't compromise on data integrity properties - we should not
+need to trust a peer we're downloading from to be "honest" about the partial
+data we're reading. We should be able to verify smaller reads.
+
+Especially when substituting from an untrusted third-party, we want to be able
+to detect quickly if that third-party is sending us wrong data, and terminate
+the connection early.
+
+## Chunking
+In content-addressed systems, this problem has historically been solved by
+breaking larger blobs into smaller chunks, which can be fetched individually,
+and making a hash of *this listing* the blob digest/identifier.
+
+ - BitTorrent for example breaks files up into smaller chunks, and maintains
+   a list of sha1 digests for each of these chunks. Magnet links contain a
+   digest over this listing as an identifier. (See [bittorrent-v2][here for
+   more details]).
+   With the identifier, a client can fetch the entire list, and then recursively
+   "unpack the graph" of nodes, until it ends up with a list of individual small
+   chunks, which can be fetched individually.
+ - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
+   digest of the root node as an identitier.
+
+These approaches solve the problem of being able to fetch smaller chunks in a
+trusted fashion. They can also do some deduplication, in case there's the same
+leaf nodes same leaf nodes in multiple places.
+
+However, they also have a big disadvantage. The chunking parameters, and the
+"topology" of the graph structure itself "bleed" into the root hash of the
+entire data structure itself.
+
+Depending on the chunking parameters used, there's different representations for
+the same data, causing less data sharing/reuse in the overall system, in terms of how
+many chunks need to be downloaded vs. are already available locally, as well as
+how compact data is stored on-disk.
+
+This can be workarounded by agreeing on only a single way of chunking, but it's
+not pretty and misses a lot of deduplication potential.
+
+### Chunking in Tvix' Blobstore
+tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
+disadvantages, while still being content-addressed internally, with the
+highlighted benefits.
+
+It uses [blake3] as hash function, and the blake3 digest of **the raw data
+itself** as an identifier (rather than some application-specific Merkle DAG that
+also embeds some chunking information).
+
+BLAKE3 is a tree hash where all left nodes fully populated, contrary to
+conventional serial hash functions. To be able to validate the hash of a node,
+one only needs the hash of the (2) children [^1], if any.
+
+This means one only needs to the root digest to validate a constructions, and these
+constructions can be sent [separately][bao-spec].
+
+This relieves us from the need of having to encode more granular chunking into
+our data model / identifier upfront, but can make this a mostly a transport/
+storage concern.
+
+For the some more description on the (remote) protocol, check
+`./blobstore-protocol.md`.
+
+#### Logical vs. physical chunking
+
+Due to the properties of the BLAKE3 hash function, we have logical blocks of
+1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
+chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
+or are stored on-disk.
+
+The only thing we need to be able to read and verify an arbitrary byte range is
+having the covering range of aligned 1K blocks, and a construction from the root
+digest to the 1K block.
+
+Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
+lower parts of the tree while still providing verified streaming - at the cost
+of having to fetch larger covering ranges of aligned blocks.
+
+Let's pick an example. We identify each KiB by a number here for illustrational
+purposes.
+
+Assuming we omit the last two layers of the hash tree, we end up with logical
+4KiB leaf chunks (`bao_shift` of `2`).
+
+For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
+`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
+
+`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
+
+Assuming the server now informs us about the following physical chunking:
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
+```
+
+If our application now wants to arbitrarily read from 0 until 4 (inclusive):
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
+ |-------------|
+
+```
+
+…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
+
+
+`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
+interested in.
+
+We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
+can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
+^       ^           ^     ^
+|----4KiB----|------4KiB-----|
+```
+
+Each physical chunk fetched can be validated to have the blake3 digest that was
+communicated upfront, and can be stored in a client-side cache/storage, so
+subsequent / other requests for the same data will be fast(er).
+
+---
+
+[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
+[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
+[bao-tree]: https://github.com/n0-computer/bao-tree
diff --git a/tvix/castore/docs/blobstore-protocol.md b/tvix/castore/docs/blobstore-protocol.md
new file mode 100644
index 0000000000..048cafc3d8
--- /dev/null
+++ b/tvix/castore/docs/blobstore-protocol.md
@@ -0,0 +1,104 @@
+# BlobStore: Protocol / Composition
+
+This documents describes the protocol that BlobStore uses to substitute blobs
+other ("remote") BlobStores.
+
+How to come up with the blake3 digest of the blob to fetch is left to another
+layer in the stack.
+
+To put this into the context of Tvix as a Nix alternative, a blob represents an
+individual file inside a StorePath.
+In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
+`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
+message) encode a BLAKE3 digest.
+
+However, the whole infrastructure can be applied for other usecases requiring
+exchange/storage or access into data of which the blake3 digest is known.
+
+## Protocol and Interfaces
+As an RPC protocol, BlobStore currently uses gRPC.
+
+On the Rust side of things, every blob service implements the
+[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
+gRPC-specific.
+
+This `BlobService` trait provides functionality to check for existence of Blobs,
+read from blobs, and write new blobs.
+It also provides a method to ask for more granular chunks if they are available.
+
+In addition to some in-memory, on-disk and (soon) object-storage-based
+implementations, we also have a `BlobService` implementation that talks to a
+gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
+service for anything implementing the `BlobService` trait.
+
+This makes it very easy to talk to a remote `BlobService`, which does not even
+need to be written in the same language, as long it speaks the same gRPC
+protocol.
+
+It also puts very little requirements on someone implementing a new
+`BlobService`, and how its internal storage or chunking algorithm looks like.
+
+The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
+Contrary to the `BlobService` trait, it does not have any options for seeking/
+ranging, as it's more desirable to provide this through chunking (see also
+`./blobstore-chunking.md`).
+
+## Composition
+Different `BlobStore` are supposed to be "composed"/"layered" to express
+caching, multiple local and remote sources.
+
+The fronting interface can be the same, it'd just be multiple "tiers" that can
+respond to requests, depending on where the data resides. [^1]
+
+This makes it very simple for consumers, as they don't need to be aware of the
+entire substitutor config.
+
+The flexibility of this doesn't need to be exposed to the user in the default
+case; in most cases we should be fine with some form of on-disk storage and a
+bunch of substituters with different priorities.
+
+### gRPC Clients
+Clients are encouraged to always read blobs in a chunked fashion (asking for a
+list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
+`BlobService.Read()` as needed), instead of directly reading the entire blob via
+`BlobService.Read()`.
+
+In a composition setting, this provides opportunity for caching, and avoids
+downloading some chunks if they're already present locally (for example, because
+they were already downloaded by reading from a similar blob earlier).
+
+It also removes the need for seeking to be a part of the gRPC protocol
+alltogether, as chunks are supposed to be "reasonably small" [^2].
+
+There's some further optimization potential, a `BlobService.Stat()` request
+could tell the server it's happy with very small blobs just being inlined in
+an additional additional field in the response, which would allow clients to
+populate their local chunk store in a single roundtrip.
+
+## Verified Streaming
+As already described in `./docs/blobstore-chunking.md`, the physical chunk
+information sent in a `BlobService.Stat()` response is still sufficient to fetch
+in an authenticated fashion.
+
+The exact protocol and formats are still a bit in flux, but here's some notes:
+
+ - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
+   [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
+   how detailed (down to the leaf chunks) it should go.
+   The exact format (and request fields) still need to be defined, edef has some
+   ideas around omitting some intermediate hash nodes over the wire and
+   recomputing them, reducing size by another ~50% over [bao-tree].
+ - `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
+   field, signalling the actual format/shift level the server replies with, the
+   actual bao, and maybe some format specifier).
+   It would be nice to also be compatible with the baos used by [iroh], so we
+   can provide an implementation using it too.
+
+---
+
+[^1]: We might want to have some backchannel, so it becomes possible to provide
+      feedback to the user that something is downloaded.
+[^2]: Something between 512K-4M, TBD.
+[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
+[bao-tree]: https://github.com/n0-computer/bao-tree
+[iroh]: https://github.com/n0-computer/iroh
diff --git a/tvix/castore/docs/verified-streaming.md b/tvix/castore/docs/verified-streaming.md
deleted file mode 100644
index fc766fff5d..0000000000
--- a/tvix/castore/docs/verified-streaming.md
+++ /dev/null
@@ -1,121 +0,0 @@
-# Verified streaming
-
-`//tvix/castore` is a content-addressed storage system, using [blake3] as hash
-function.
-
-This means returned data is fetched by using the digest as lookup key, and can
-be verified to be correct by feeding the received data through the hash function
-and ensuring it matches the digest initially used for the lookup.
-
-This means, data can be downloaded by any untrusted third-party as well, as the
-received data is validated to match the digest it was originally requested with.
-
-However, for larger blobs of data, having to download the entire blob to be able
-to determine whether it's correct before being able to return it to an upper
-layer takes a lot of time, and is wasteful, if we're only interested in a small
-portion of it.
-
-Especially when substituting from an untrusted third-party, we want to be able
-to detect quickly if that third-party is sending us wrong data, and terminate
-the connection early.
-
-## Chunking
-
-This problem has historically been solved by exchanging a list of smaller
-chunks, which can be fetched individually.
-
-BitTorrent for example breaks files up into smaller chunks, and maintains a list
-of sha1 digests for each of these chunks. After the list has been fetched, this
-allows fetching smaller parts of data selectively from untrusted third-parties.
-
-Similarly, IPFS uses its IPLD model to content-address a Merkle DAG of chunk
-nodes.
-
-While these approaches solve the problem of being able to fetch smaller chunks,
-they have a big disadvantage: the chunking parameters, and the topology of
-the graph structure itself "bleed" into the hash of the entire data structure
-itself.
-
-This comes with some disadvantages:
-
-Depending on the chunking parameters used, there's different representations for
-the same data, causing less data sharing/reuse in the overall content- addressed
-system, both when downloading data from third-parties, as well as benefiting
-from data already available locally.
-
-This can be workarounded by agreeing on only single way of chunking, but it's
-not pretty.
-
-## Chunking in tvix-castore
-
-tvix-castore uses BLAKE3 as a digest function, which internally uses a fixed
-chunksize of 1024 bytes.
-
-BLAKE3 is a tree hash where all left nodes fully populated, contrary to
-conventional serial hash functions. To be able to validate the hash of a node,
-one only needs the hash of the (2) children, if any.
-
-This means one only needs to the root digest to validate a construction, and
-lower levels of the tree can be omitted.
-
-This relieves us from the need of having to communicate more granular chunking
-upfront, and making it part of our data model.
-
-## Logical vs. physical chunking
-
-Due to the properties of the BLAKE3 hash function, we have logical blocks of
-1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
-chunk sizes.
-
-The only thing we need to be able to read and verify an arbitrary byte range is
-having the covering range of aligned 1K blocks.
-
-## Actual implementation
-
- -> BlobService.Read() gets the capability to read chunks as well
- -> BlobService.Stat() can reply with a list of chunks.
-      rq params: send_bao bool
-         server should be able to offer bao all the way down to 1k
-         some open questions w.r.t sending the whole bao until there, or just
-         all the hashes on the "most granular" level
-         -> we can recreate everything above up to the root hash.
-         -> can we maybe add this to n0-computer/bao-tree as another outboard format?
-      resp:
-        - bao_shift: how many levels on the bottom were skipped.
-          0 means send all the leaf node hashes (1K block size)
-        - "our bao": blake3 digests for a given static chunk size + path down to the last leaf node and its data (length proof)
-        - list of (b3digest,size) of all physical chunks.
-          The server can do some CDC on ingestion, and communicate these chunks here.
-          Chunk sizes should be a "reasonable size", TBD, probably something between 512K-4M
-
-Depending on the bao depth received from the server, we end up with a logical
-size of chunks that can be fetched in an authenticated fashion.
-
-Assuming the bao chunk size received is 4(*1KiB bytes) (`bao_shift` of 2), and a
-total blob size of 14 (*1KiB bytes), we can fetch
-`[0..=3]`, `[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
-
-`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
-
-Assuming the server now informs us about the following physical chunking:
-
-`[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
-
-To read from 0 until 4 (inclusive), we need to fetch physical chunks
-`[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
-
-`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
-interested in.
-
-We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
-can assemble `[ 4 5 6 7 ]` to verify that logical chunk.
-
-Each physical chunk fetched can be validated to have the blake3 digest that was
-communicated upfront, and can be stored in a client-side cache/storage.
-
-If it's not there, the client can use the `BlobService.Read()` interface with
-the *physical chunk digest*.
-
----
-
-[blake3]: https://github.com/BLAKE3-team/BLAKE3
\ No newline at end of file