diff options
Diffstat (limited to 'tvix/castore')
-rw-r--r-- | tvix/castore/docs/blobstore-chunking.md | 147 | ||||
-rw-r--r-- | tvix/castore/docs/blobstore-protocol.md | 104 | ||||
-rw-r--r-- | tvix/castore/docs/data-model.md | 50 | ||||
-rw-r--r-- | tvix/castore/docs/why-not-git-trees.md | 57 |
4 files changed, 0 insertions, 358 deletions
diff --git a/tvix/castore/docs/blobstore-chunking.md b/tvix/castore/docs/blobstore-chunking.md deleted file mode 100644 index df3c29680257..000000000000 --- a/tvix/castore/docs/blobstore-chunking.md +++ /dev/null @@ -1,147 +0,0 @@ -# BlobStore: Chunking & Verified Streaming - -`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3] -as hash function. - -Returned data is fetched by using the digest as lookup key, and can be verified -to be correct by feeding the received data through the hash function and -ensuring it matches the digest initially used for the lookup. - -This means, data can be downloaded by any untrusted third-party as well, as the -received data is validated to match the digest it was originally requested with. - -However, for larger blobs of data, having to download the entire blob at once is -wasteful, if we only care about a part of the blob. Think about mounting a -seekable data structure, like loop-mounting an .iso file, or doing partial reads -in a large Parquet file, a column-oriented data format. - -> We want to have the possibility to *seek* into a larger file. - -This however shouldn't compromise on data integrity properties - we should not -need to trust a peer we're downloading from to be "honest" about the partial -data we're reading. We should be able to verify smaller reads. - -Especially when substituting from an untrusted third-party, we want to be able -to detect quickly if that third-party is sending us wrong data, and terminate -the connection early. - -## Chunking -In content-addressed systems, this problem has historically been solved by -breaking larger blobs into smaller chunks, which can be fetched individually, -and making a hash of *this listing* the blob digest/identifier. - - - BitTorrent for example breaks files up into smaller chunks, and maintains - a list of sha1 digests for each of these chunks. Magnet links contain a - digest over this listing as an identifier. (See [bittorrent-v2][here for - more details]). - With the identifier, a client can fetch the entire list, and then recursively - "unpack the graph" of nodes, until it ends up with a list of individual small - chunks, which can be fetched individually. - - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the - digest of the root node as an identitier. - -These approaches solve the problem of being able to fetch smaller chunks in a -trusted fashion. They can also do some deduplication, in case there's the same -leaf nodes same leaf nodes in multiple places. - -However, they also have a big disadvantage. The chunking parameters, and the -"topology" of the graph structure itself "bleed" into the root hash of the -entire data structure itself. - -Depending on the chunking parameters used, there's different representations for -the same data, causing less data sharing/reuse in the overall system, in terms of how -many chunks need to be downloaded vs. are already available locally, as well as -how compact data is stored on-disk. - -This can be workarounded by agreeing on only a single way of chunking, but it's -not pretty and misses a lot of deduplication potential. - -### Chunking in Tvix' Blobstore -tvix-castore's BlobStore uses a hybrid approach to eliminate some of the -disadvantages, while still being content-addressed internally, with the -highlighted benefits. - -It uses [blake3] as hash function, and the blake3 digest of **the raw data -itself** as an identifier (rather than some application-specific Merkle DAG that -also embeds some chunking information). - -BLAKE3 is a tree hash where all left nodes fully populated, contrary to -conventional serial hash functions. To be able to validate the hash of a node, -one only needs the hash of the (2) children [^1], if any. - -This means one only needs to the root digest to validate a constructions, and these -constructions can be sent [separately][bao-spec]. - -This relieves us from the need of having to encode more granular chunking into -our data model / identifier upfront, but can make this mostly a transport/ -storage concern. - -For some more description on the (remote) protocol, check -`./blobstore-protocol.md`. - -#### Logical vs. physical chunking - -Due to the properties of the BLAKE3 hash function, we have logical blocks of -1KiB, but this doesn't necessarily imply we need to restrict ourselves to these -chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers, -or are stored on-disk. - -The only thing we need to be able to read and verify an arbitrary byte range is -having the covering range of aligned 1K blocks, and a construction from the root -digest to the 1K block. - -Note the intermediate hash tree can be further trimmed, [omitting][bao-tree] -lower parts of the tree while still providing verified streaming - at the cost -of having to fetch larger covering ranges of aligned blocks. - -Let's pick an example. We identify each KiB by a number here for illustrational -purposes. - -Assuming we omit the last two layers of the hash tree, we end up with logical -4KiB leaf chunks (`bao_shift` of `2`). - -For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`, -`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion: - -`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]` - -Assuming the server now informs us about the following physical chunking: - -``` -[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]` -``` - -If our application now wants to arbitrarily read from 0 until 4 (inclusive): - -``` -[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] - |-------------| - -``` - -…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`. - - -`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're -interested in. - -We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we -can assemble `[ 4 5 6 7 ]` to verify both logical chunks: - -``` -[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] -^ ^ ^ ^ -|----4KiB----|------4KiB-----| -``` - -Each physical chunk fetched can be validated to have the blake3 digest that was -communicated upfront, and can be stored in a client-side cache/storage, so -subsequent / other requests for the same data will be fast(er). - ---- - -[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree -[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/ -[blake3]: https://github.com/BLAKE3-team/BLAKE3 -[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md -[bao-tree]: https://github.com/n0-computer/bao-tree diff --git a/tvix/castore/docs/blobstore-protocol.md b/tvix/castore/docs/blobstore-protocol.md deleted file mode 100644 index 048cafc3d877..000000000000 --- a/tvix/castore/docs/blobstore-protocol.md +++ /dev/null @@ -1,104 +0,0 @@ -# BlobStore: Protocol / Composition - -This documents describes the protocol that BlobStore uses to substitute blobs -other ("remote") BlobStores. - -How to come up with the blake3 digest of the blob to fetch is left to another -layer in the stack. - -To put this into the context of Tvix as a Nix alternative, a blob represents an -individual file inside a StorePath. -In the Tvix Data Model, this is accomplished by having a `FileNode` (either the -`root_node` in a `PathInfo` message, or a individual file inside a `Directory` -message) encode a BLAKE3 digest. - -However, the whole infrastructure can be applied for other usecases requiring -exchange/storage or access into data of which the blake3 digest is known. - -## Protocol and Interfaces -As an RPC protocol, BlobStore currently uses gRPC. - -On the Rust side of things, every blob service implements the -[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't -gRPC-specific. - -This `BlobService` trait provides functionality to check for existence of Blobs, -read from blobs, and write new blobs. -It also provides a method to ask for more granular chunks if they are available. - -In addition to some in-memory, on-disk and (soon) object-storage-based -implementations, we also have a `BlobService` implementation that talks to a -gRPC server, as well as a gRPC server wrapper component, which provides a gRPC -service for anything implementing the `BlobService` trait. - -This makes it very easy to talk to a remote `BlobService`, which does not even -need to be written in the same language, as long it speaks the same gRPC -protocol. - -It also puts very little requirements on someone implementing a new -`BlobService`, and how its internal storage or chunking algorithm looks like. - -The gRPC protocol is documented in `../protos/rpc_blobstore.proto`. -Contrary to the `BlobService` trait, it does not have any options for seeking/ -ranging, as it's more desirable to provide this through chunking (see also -`./blobstore-chunking.md`). - -## Composition -Different `BlobStore` are supposed to be "composed"/"layered" to express -caching, multiple local and remote sources. - -The fronting interface can be the same, it'd just be multiple "tiers" that can -respond to requests, depending on where the data resides. [^1] - -This makes it very simple for consumers, as they don't need to be aware of the -entire substitutor config. - -The flexibility of this doesn't need to be exposed to the user in the default -case; in most cases we should be fine with some form of on-disk storage and a -bunch of substituters with different priorities. - -### gRPC Clients -Clients are encouraged to always read blobs in a chunked fashion (asking for a -list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via -`BlobService.Read()` as needed), instead of directly reading the entire blob via -`BlobService.Read()`. - -In a composition setting, this provides opportunity for caching, and avoids -downloading some chunks if they're already present locally (for example, because -they were already downloaded by reading from a similar blob earlier). - -It also removes the need for seeking to be a part of the gRPC protocol -alltogether, as chunks are supposed to be "reasonably small" [^2]. - -There's some further optimization potential, a `BlobService.Stat()` request -could tell the server it's happy with very small blobs just being inlined in -an additional additional field in the response, which would allow clients to -populate their local chunk store in a single roundtrip. - -## Verified Streaming -As already described in `./docs/blobstore-chunking.md`, the physical chunk -information sent in a `BlobService.Stat()` response is still sufficient to fetch -in an authenticated fashion. - -The exact protocol and formats are still a bit in flux, but here's some notes: - - - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a - [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling - how detailed (down to the leaf chunks) it should go. - The exact format (and request fields) still need to be defined, edef has some - ideas around omitting some intermediate hash nodes over the wire and - recomputing them, reducing size by another ~50% over [bao-tree]. - - `BlobService.Stat()` response gets some bao-related fields (`bao_shift` - field, signalling the actual format/shift level the server replies with, the - actual bao, and maybe some format specifier). - It would be nice to also be compatible with the baos used by [iroh], so we - can provide an implementation using it too. - ---- - -[^1]: We might want to have some backchannel, so it becomes possible to provide - feedback to the user that something is downloaded. -[^2]: Something between 512K-4M, TBD. -[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md -[bao-tree]: https://github.com/n0-computer/bao-tree -[iroh]: https://github.com/n0-computer/iroh diff --git a/tvix/castore/docs/data-model.md b/tvix/castore/docs/data-model.md deleted file mode 100644 index 5e6220cc23fa..000000000000 --- a/tvix/castore/docs/data-model.md +++ /dev/null @@ -1,50 +0,0 @@ -# Data model - -This provides some more notes on the fields used in castore.proto. - -See `//tvix/store/docs/api.md` for the full context. - -## Directory message -`Directory` messages use the blake3 hash of their canonical protobuf -serialization as its identifier. - -A `Directory` message contains three lists, `directories`, `files` and -`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages -respectively. They describe all the direct child elements that are contained in -a directory. - -All three message types have a `name` field, specifying the (base)name of the -element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..'). -For reproducibility reasons, the lists MUST be sorted by that name and the -name MUST be unique across all three lists. - -In addition to the `name` field, the various *Node messages have the following -fields: - -## DirectoryNode -A `DirectoryNode` message represents a child directory. - -It has a `digest` field, which points to the identifier of another `Directory` -message, making a `Directory` a merkle tree (or strictly speaking, a graph, as -two elements pointing to a child directory with the same contents would point -to the same `Directory` message). - -There's also a `size` field, containing the (total) number of all child -elements in the referenced `Directory`, which helps for inode calculation. - -## FileNode -A `FileNode` message represents a child (regular) file. - -Its `digest` field contains the blake3 hash of the file contents. It can be -looked up in the `BlobService`. - -The `size` field contains the size of the blob the `digest` field refers to. - -The `executable` field specifies whether the file should be marked as -executable or not. - -## SymlinkNode -A `SymlinkNode` message represents a child symlink. - -In addition to the `name` field, the only additional field is the `target`, -which is a string containing the target of the symlink. diff --git a/tvix/castore/docs/why-not-git-trees.md b/tvix/castore/docs/why-not-git-trees.md deleted file mode 100644 index 4a12b4ef5554..000000000000 --- a/tvix/castore/docs/why-not-git-trees.md +++ /dev/null @@ -1,57 +0,0 @@ -## Why not git tree objects? - -We've been experimenting with (some variations of) the git tree and object -format, and ultimately decided against using it as an internal format, and -instead adapted the one documented in the other documents here. - -While the tvix-store API protocol shares some similarities with the format used -in git for trees and objects, the git one has shown some significant -disadvantages: - -### The binary encoding itself - -#### trees -The git tree object format is a very binary, error-prone and -"made-to-be-read-and-written-from-C" format. - -Tree objects are a combination of null-terminated strings, and fields of known -length. References to other tree objects use the literal sha1 hash of another -tree object in this encoding. -Extensions of the format/changes are very hard to do right, because parsers are -not aware they might be parsing something different. - -The tvix-store protocol uses a canonical protobuf serialization, and uses -the [blake3][blake3] hash of that serialization to point to other `Directory` -messages. -It's both compact and with a wide range of libraries for encoders and decoders -in many programming languages. -The choice of protobuf makes it easy to add new fields, and make old clients -aware of some unknown fields being detected [^adding-fields]. - -#### blob -On disk, git blob objects start with a "blob" prefix, then the size of the -payload, and then the data itself. The hash of a blob is the literal sha1sum -over all of this - which makes it something very git specific to request for. - -tvix-store simply uses the [blake3][blake3] hash of the literal contents -when referring to a file/blob, which makes it very easy to ask other data -sources for the same data, as no git-specific payload is included in the hash. -This also plays very well together with things like [iroh][iroh-discussion], -which plans to provide a way to substitute (large)blobs by their blake3 hash -over the IPFS network. - -In addition to that, [blake3][blake3] makes it possible to do -[verified streaming][bao], as already described in other parts of the -documentation. - -The git tree object format uses sha1 both for references to other trees and -hashes of blobs, which isn't really a hash function to fundamentally base -everything on in 2023. -The [migration to sha256][git-sha256] also has been dead for some years now, -and it's unclear what a "blake3" version of this would even look like. - -[bao]: https://github.com/oconnor663/bao -[blake3]: https://github.com/BLAKE3-team/BLAKE3 -[git-sha256]: https://git-scm.com/docs/hash-function-transition/ -[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197 -[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect. \ No newline at end of file |