diff options
Diffstat (limited to 'tvix/castore')
-rw-r--r-- | tvix/castore/docs/blobstore-chunking.md | 147 | ||||
-rw-r--r-- | tvix/castore/docs/blobstore-protocol.md | 104 | ||||
-rw-r--r-- | tvix/castore/docs/verified-streaming.md | 121 |
3 files changed, 251 insertions, 121 deletions
diff --git a/tvix/castore/docs/blobstore-chunking.md b/tvix/castore/docs/blobstore-chunking.md new file mode 100644 index 000000000000..49bbe6927554 --- /dev/null +++ b/tvix/castore/docs/blobstore-chunking.md @@ -0,0 +1,147 @@ +# BlobStore: Chunking & Verified Streaming + +`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3] +as hash function. + +Returned data is fetched by using the digest as lookup key, and can be verified +to be correct by feeding the received data through the hash function and +ensuring it matches the digest initially used for the lookup. + +This means, data can be downloaded by any untrusted third-party as well, as the +received data is validated to match the digest it was originally requested with. + +However, for larger blobs of data, having to download the entire blob at once is +wasteful, if we only care about a part of the blob. Think about mounting a +seekable data structure, like loop-mounting an .iso file, or doing partial reads +in a large Parquet file, a column-oriented data format. + +> We want to have the possibility to *seek* into a larger file. + +This however shouldn't compromise on data integrity properties - we should not +need to trust a peer we're downloading from to be "honest" about the partial +data we're reading. We should be able to verify smaller reads. + +Especially when substituting from an untrusted third-party, we want to be able +to detect quickly if that third-party is sending us wrong data, and terminate +the connection early. + +## Chunking +In content-addressed systems, this problem has historically been solved by +breaking larger blobs into smaller chunks, which can be fetched individually, +and making a hash of *this listing* the blob digest/identifier. + + - BitTorrent for example breaks files up into smaller chunks, and maintains + a list of sha1 digests for each of these chunks. Magnet links contain a + digest over this listing as an identifier. (See [bittorrent-v2][here for + more details]). + With the identifier, a client can fetch the entire list, and then recursively + "unpack the graph" of nodes, until it ends up with a list of individual small + chunks, which can be fetched individually. + - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the + digest of the root node as an identitier. + +These approaches solve the problem of being able to fetch smaller chunks in a +trusted fashion. They can also do some deduplication, in case there's the same +leaf nodes same leaf nodes in multiple places. + +However, they also have a big disadvantage. The chunking parameters, and the +"topology" of the graph structure itself "bleed" into the root hash of the +entire data structure itself. + +Depending on the chunking parameters used, there's different representations for +the same data, causing less data sharing/reuse in the overall system, in terms of how +many chunks need to be downloaded vs. are already available locally, as well as +how compact data is stored on-disk. + +This can be workarounded by agreeing on only a single way of chunking, but it's +not pretty and misses a lot of deduplication potential. + +### Chunking in Tvix' Blobstore +tvix-castore's BlobStore uses a hybrid approach to eliminate some of the +disadvantages, while still being content-addressed internally, with the +highlighted benefits. + +It uses [blake3] as hash function, and the blake3 digest of **the raw data +itself** as an identifier (rather than some application-specific Merkle DAG that +also embeds some chunking information). + +BLAKE3 is a tree hash where all left nodes fully populated, contrary to +conventional serial hash functions. To be able to validate the hash of a node, +one only needs the hash of the (2) children [^1], if any. + +This means one only needs to the root digest to validate a constructions, and these +constructions can be sent [separately][bao-spec]. + +This relieves us from the need of having to encode more granular chunking into +our data model / identifier upfront, but can make this a mostly a transport/ +storage concern. + +For the some more description on the (remote) protocol, check +`./blobstore-protocol.md`. + +#### Logical vs. physical chunking + +Due to the properties of the BLAKE3 hash function, we have logical blocks of +1KiB, but this doesn't necessarily imply we need to restrict ourselves to these +chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers, +or are stored on-disk. + +The only thing we need to be able to read and verify an arbitrary byte range is +having the covering range of aligned 1K blocks, and a construction from the root +digest to the 1K block. + +Note the intermediate hash tree can be further trimmed, [omitting][bao-tree] +lower parts of the tree while still providing verified streaming - at the cost +of having to fetch larger covering ranges of aligned blocks. + +Let's pick an example. We identify each KiB by a number here for illustrational +purposes. + +Assuming we omit the last two layers of the hash tree, we end up with logical +4KiB leaf chunks (`bao_shift` of `2`). + +For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`, +`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion: + +`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]` + +Assuming the server now informs us about the following physical chunking: + +``` +[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]` +``` + +If our application now wants to arbitrarily read from 0 until 4 (inclusive): + +``` +[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] + |-------------| + +``` + +…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`. + + +`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're +interested in. + +We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we +can assemble `[ 4 5 6 7 ]` to verify both logical chunks: + +``` +[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] +^ ^ ^ ^ +|----4KiB----|------4KiB-----| +``` + +Each physical chunk fetched can be validated to have the blake3 digest that was +communicated upfront, and can be stored in a client-side cache/storage, so +subsequent / other requests for the same data will be fast(er). + +--- + +[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree +[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/ +[blake3]: https://github.com/BLAKE3-team/BLAKE3 +[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md +[bao-tree]: https://github.com/n0-computer/bao-tree diff --git a/tvix/castore/docs/blobstore-protocol.md b/tvix/castore/docs/blobstore-protocol.md new file mode 100644 index 000000000000..048cafc3d877 --- /dev/null +++ b/tvix/castore/docs/blobstore-protocol.md @@ -0,0 +1,104 @@ +# BlobStore: Protocol / Composition + +This documents describes the protocol that BlobStore uses to substitute blobs +other ("remote") BlobStores. + +How to come up with the blake3 digest of the blob to fetch is left to another +layer in the stack. + +To put this into the context of Tvix as a Nix alternative, a blob represents an +individual file inside a StorePath. +In the Tvix Data Model, this is accomplished by having a `FileNode` (either the +`root_node` in a `PathInfo` message, or a individual file inside a `Directory` +message) encode a BLAKE3 digest. + +However, the whole infrastructure can be applied for other usecases requiring +exchange/storage or access into data of which the blake3 digest is known. + +## Protocol and Interfaces +As an RPC protocol, BlobStore currently uses gRPC. + +On the Rust side of things, every blob service implements the +[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't +gRPC-specific. + +This `BlobService` trait provides functionality to check for existence of Blobs, +read from blobs, and write new blobs. +It also provides a method to ask for more granular chunks if they are available. + +In addition to some in-memory, on-disk and (soon) object-storage-based +implementations, we also have a `BlobService` implementation that talks to a +gRPC server, as well as a gRPC server wrapper component, which provides a gRPC +service for anything implementing the `BlobService` trait. + +This makes it very easy to talk to a remote `BlobService`, which does not even +need to be written in the same language, as long it speaks the same gRPC +protocol. + +It also puts very little requirements on someone implementing a new +`BlobService`, and how its internal storage or chunking algorithm looks like. + +The gRPC protocol is documented in `../protos/rpc_blobstore.proto`. +Contrary to the `BlobService` trait, it does not have any options for seeking/ +ranging, as it's more desirable to provide this through chunking (see also +`./blobstore-chunking.md`). + +## Composition +Different `BlobStore` are supposed to be "composed"/"layered" to express +caching, multiple local and remote sources. + +The fronting interface can be the same, it'd just be multiple "tiers" that can +respond to requests, depending on where the data resides. [^1] + +This makes it very simple for consumers, as they don't need to be aware of the +entire substitutor config. + +The flexibility of this doesn't need to be exposed to the user in the default +case; in most cases we should be fine with some form of on-disk storage and a +bunch of substituters with different priorities. + +### gRPC Clients +Clients are encouraged to always read blobs in a chunked fashion (asking for a +list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via +`BlobService.Read()` as needed), instead of directly reading the entire blob via +`BlobService.Read()`. + +In a composition setting, this provides opportunity for caching, and avoids +downloading some chunks if they're already present locally (for example, because +they were already downloaded by reading from a similar blob earlier). + +It also removes the need for seeking to be a part of the gRPC protocol +alltogether, as chunks are supposed to be "reasonably small" [^2]. + +There's some further optimization potential, a `BlobService.Stat()` request +could tell the server it's happy with very small blobs just being inlined in +an additional additional field in the response, which would allow clients to +populate their local chunk store in a single roundtrip. + +## Verified Streaming +As already described in `./docs/blobstore-chunking.md`, the physical chunk +information sent in a `BlobService.Stat()` response is still sufficient to fetch +in an authenticated fashion. + +The exact protocol and formats are still a bit in flux, but here's some notes: + + - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a + [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling + how detailed (down to the leaf chunks) it should go. + The exact format (and request fields) still need to be defined, edef has some + ideas around omitting some intermediate hash nodes over the wire and + recomputing them, reducing size by another ~50% over [bao-tree]. + - `BlobService.Stat()` response gets some bao-related fields (`bao_shift` + field, signalling the actual format/shift level the server replies with, the + actual bao, and maybe some format specifier). + It would be nice to also be compatible with the baos used by [iroh], so we + can provide an implementation using it too. + +--- + +[^1]: We might want to have some backchannel, so it becomes possible to provide + feedback to the user that something is downloaded. +[^2]: Something between 512K-4M, TBD. +[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md +[bao-tree]: https://github.com/n0-computer/bao-tree +[iroh]: https://github.com/n0-computer/iroh diff --git a/tvix/castore/docs/verified-streaming.md b/tvix/castore/docs/verified-streaming.md deleted file mode 100644 index fc766fff5d5a..000000000000 --- a/tvix/castore/docs/verified-streaming.md +++ /dev/null @@ -1,121 +0,0 @@ -# Verified streaming - -`//tvix/castore` is a content-addressed storage system, using [blake3] as hash -function. - -This means returned data is fetched by using the digest as lookup key, and can -be verified to be correct by feeding the received data through the hash function -and ensuring it matches the digest initially used for the lookup. - -This means, data can be downloaded by any untrusted third-party as well, as the -received data is validated to match the digest it was originally requested with. - -However, for larger blobs of data, having to download the entire blob to be able -to determine whether it's correct before being able to return it to an upper -layer takes a lot of time, and is wasteful, if we're only interested in a small -portion of it. - -Especially when substituting from an untrusted third-party, we want to be able -to detect quickly if that third-party is sending us wrong data, and terminate -the connection early. - -## Chunking - -This problem has historically been solved by exchanging a list of smaller -chunks, which can be fetched individually. - -BitTorrent for example breaks files up into smaller chunks, and maintains a list -of sha1 digests for each of these chunks. After the list has been fetched, this -allows fetching smaller parts of data selectively from untrusted third-parties. - -Similarly, IPFS uses its IPLD model to content-address a Merkle DAG of chunk -nodes. - -While these approaches solve the problem of being able to fetch smaller chunks, -they have a big disadvantage: the chunking parameters, and the topology of -the graph structure itself "bleed" into the hash of the entire data structure -itself. - -This comes with some disadvantages: - -Depending on the chunking parameters used, there's different representations for -the same data, causing less data sharing/reuse in the overall content- addressed -system, both when downloading data from third-parties, as well as benefiting -from data already available locally. - -This can be workarounded by agreeing on only single way of chunking, but it's -not pretty. - -## Chunking in tvix-castore - -tvix-castore uses BLAKE3 as a digest function, which internally uses a fixed -chunksize of 1024 bytes. - -BLAKE3 is a tree hash where all left nodes fully populated, contrary to -conventional serial hash functions. To be able to validate the hash of a node, -one only needs the hash of the (2) children, if any. - -This means one only needs to the root digest to validate a construction, and -lower levels of the tree can be omitted. - -This relieves us from the need of having to communicate more granular chunking -upfront, and making it part of our data model. - -## Logical vs. physical chunking - -Due to the properties of the BLAKE3 hash function, we have logical blocks of -1KiB, but this doesn't necessarily imply we need to restrict ourselves to these -chunk sizes. - -The only thing we need to be able to read and verify an arbitrary byte range is -having the covering range of aligned 1K blocks. - -## Actual implementation - - -> BlobService.Read() gets the capability to read chunks as well - -> BlobService.Stat() can reply with a list of chunks. - rq params: send_bao bool - server should be able to offer bao all the way down to 1k - some open questions w.r.t sending the whole bao until there, or just - all the hashes on the "most granular" level - -> we can recreate everything above up to the root hash. - -> can we maybe add this to n0-computer/bao-tree as another outboard format? - resp: - - bao_shift: how many levels on the bottom were skipped. - 0 means send all the leaf node hashes (1K block size) - - "our bao": blake3 digests for a given static chunk size + path down to the last leaf node and its data (length proof) - - list of (b3digest,size) of all physical chunks. - The server can do some CDC on ingestion, and communicate these chunks here. - Chunk sizes should be a "reasonable size", TBD, probably something between 512K-4M - -Depending on the bao depth received from the server, we end up with a logical -size of chunks that can be fetched in an authenticated fashion. - -Assuming the bao chunk size received is 4(*1KiB bytes) (`bao_shift` of 2), and a -total blob size of 14 (*1KiB bytes), we can fetch -`[0..=3]`, `[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion: - -`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]` - -Assuming the server now informs us about the following physical chunking: - -`[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]` - -To read from 0 until 4 (inclusive), we need to fetch physical chunks -`[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`. - -`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're -interested in. - -We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we -can assemble `[ 4 5 6 7 ]` to verify that logical chunk. - -Each physical chunk fetched can be validated to have the blake3 digest that was -communicated upfront, and can be stored in a client-side cache/storage. - -If it's not there, the client can use the `BlobService.Read()` interface with -the *physical chunk digest*. - ---- - -[blake3]: https://github.com/BLAKE3-team/BLAKE3 \ No newline at end of file |