From 6947dc4349fa85cb702f46acfe3255c907096b12 Mon Sep 17 00:00:00 2001 From: Florian Klink Date: Thu, 13 Jun 2024 22:04:32 +0300 Subject: chore(tvix/docs): move [ca]store docs to tvix/docs Change-Id: Idd78ffae34b6ea7b93d13de73b98c61a348869fb Reviewed-on: https://cl.tvl.fyi/c/depot/+/11808 Tested-by: BuildkiteCI Reviewed-by: tazjin Autosubmit: flokli --- tvix/castore/docs/blobstore-chunking.md | 147 -------------- tvix/castore/docs/blobstore-protocol.md | 104 ---------- tvix/castore/docs/data-model.md | 50 ----- tvix/castore/docs/why-not-git-trees.md | 57 ------ tvix/docs/src/SUMMARY.md | 7 + tvix/docs/src/castore/blobstore-chunking.md | 147 ++++++++++++++ tvix/docs/src/castore/blobstore-protocol.md | 104 ++++++++++ tvix/docs/src/castore/data-model.md | 50 +++++ tvix/docs/src/castore/why-not-git-trees.md | 57 ++++++ tvix/docs/src/store/api.md | 288 ++++++++++++++++++++++++++++ tvix/store/default.nix | 2 +- tvix/store/docs/api.md | 288 ---------------------------- 12 files changed, 654 insertions(+), 647 deletions(-) delete mode 100644 tvix/castore/docs/blobstore-chunking.md delete mode 100644 tvix/castore/docs/blobstore-protocol.md delete mode 100644 tvix/castore/docs/data-model.md delete mode 100644 tvix/castore/docs/why-not-git-trees.md create mode 100644 tvix/docs/src/castore/blobstore-chunking.md create mode 100644 tvix/docs/src/castore/blobstore-protocol.md create mode 100644 tvix/docs/src/castore/data-model.md create mode 100644 tvix/docs/src/castore/why-not-git-trees.md create mode 100644 tvix/docs/src/store/api.md delete mode 100644 tvix/store/docs/api.md diff --git a/tvix/castore/docs/blobstore-chunking.md b/tvix/castore/docs/blobstore-chunking.md deleted file mode 100644 index df3c29680257..000000000000 --- a/tvix/castore/docs/blobstore-chunking.md +++ /dev/null @@ -1,147 +0,0 @@ -# BlobStore: Chunking & Verified Streaming - -`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3] -as hash function. - -Returned data is fetched by using the digest as lookup key, and can be verified -to be correct by feeding the received data through the hash function and -ensuring it matches the digest initially used for the lookup. - -This means, data can be downloaded by any untrusted third-party as well, as the -received data is validated to match the digest it was originally requested with. - -However, for larger blobs of data, having to download the entire blob at once is -wasteful, if we only care about a part of the blob. Think about mounting a -seekable data structure, like loop-mounting an .iso file, or doing partial reads -in a large Parquet file, a column-oriented data format. - -> We want to have the possibility to *seek* into a larger file. - -This however shouldn't compromise on data integrity properties - we should not -need to trust a peer we're downloading from to be "honest" about the partial -data we're reading. We should be able to verify smaller reads. - -Especially when substituting from an untrusted third-party, we want to be able -to detect quickly if that third-party is sending us wrong data, and terminate -the connection early. - -## Chunking -In content-addressed systems, this problem has historically been solved by -breaking larger blobs into smaller chunks, which can be fetched individually, -and making a hash of *this listing* the blob digest/identifier. - - - BitTorrent for example breaks files up into smaller chunks, and maintains - a list of sha1 digests for each of these chunks. Magnet links contain a - digest over this listing as an identifier. (See [bittorrent-v2][here for - more details]). - With the identifier, a client can fetch the entire list, and then recursively - "unpack the graph" of nodes, until it ends up with a list of individual small - chunks, which can be fetched individually. - - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the - digest of the root node as an identitier. - -These approaches solve the problem of being able to fetch smaller chunks in a -trusted fashion. They can also do some deduplication, in case there's the same -leaf nodes same leaf nodes in multiple places. - -However, they also have a big disadvantage. The chunking parameters, and the -"topology" of the graph structure itself "bleed" into the root hash of the -entire data structure itself. - -Depending on the chunking parameters used, there's different representations for -the same data, causing less data sharing/reuse in the overall system, in terms of how -many chunks need to be downloaded vs. are already available locally, as well as -how compact data is stored on-disk. - -This can be workarounded by agreeing on only a single way of chunking, but it's -not pretty and misses a lot of deduplication potential. - -### Chunking in Tvix' Blobstore -tvix-castore's BlobStore uses a hybrid approach to eliminate some of the -disadvantages, while still being content-addressed internally, with the -highlighted benefits. - -It uses [blake3] as hash function, and the blake3 digest of **the raw data -itself** as an identifier (rather than some application-specific Merkle DAG that -also embeds some chunking information). - -BLAKE3 is a tree hash where all left nodes fully populated, contrary to -conventional serial hash functions. To be able to validate the hash of a node, -one only needs the hash of the (2) children [^1], if any. - -This means one only needs to the root digest to validate a constructions, and these -constructions can be sent [separately][bao-spec]. - -This relieves us from the need of having to encode more granular chunking into -our data model / identifier upfront, but can make this mostly a transport/ -storage concern. - -For some more description on the (remote) protocol, check -`./blobstore-protocol.md`. - -#### Logical vs. physical chunking - -Due to the properties of the BLAKE3 hash function, we have logical blocks of -1KiB, but this doesn't necessarily imply we need to restrict ourselves to these -chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers, -or are stored on-disk. - -The only thing we need to be able to read and verify an arbitrary byte range is -having the covering range of aligned 1K blocks, and a construction from the root -digest to the 1K block. - -Note the intermediate hash tree can be further trimmed, [omitting][bao-tree] -lower parts of the tree while still providing verified streaming - at the cost -of having to fetch larger covering ranges of aligned blocks. - -Let's pick an example. We identify each KiB by a number here for illustrational -purposes. - -Assuming we omit the last two layers of the hash tree, we end up with logical -4KiB leaf chunks (`bao_shift` of `2`). - -For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`, -`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion: - -`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]` - -Assuming the server now informs us about the following physical chunking: - -``` -[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]` -``` - -If our application now wants to arbitrarily read from 0 until 4 (inclusive): - -``` -[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] - |-------------| - -``` - -…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`. - - -`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're -interested in. - -We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we -can assemble `[ 4 5 6 7 ]` to verify both logical chunks: - -``` -[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] -^ ^ ^ ^ -|----4KiB----|------4KiB-----| -``` - -Each physical chunk fetched can be validated to have the blake3 digest that was -communicated upfront, and can be stored in a client-side cache/storage, so -subsequent / other requests for the same data will be fast(er). - ---- - -[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree -[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/ -[blake3]: https://github.com/BLAKE3-team/BLAKE3 -[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md -[bao-tree]: https://github.com/n0-computer/bao-tree diff --git a/tvix/castore/docs/blobstore-protocol.md b/tvix/castore/docs/blobstore-protocol.md deleted file mode 100644 index 048cafc3d877..000000000000 --- a/tvix/castore/docs/blobstore-protocol.md +++ /dev/null @@ -1,104 +0,0 @@ -# BlobStore: Protocol / Composition - -This documents describes the protocol that BlobStore uses to substitute blobs -other ("remote") BlobStores. - -How to come up with the blake3 digest of the blob to fetch is left to another -layer in the stack. - -To put this into the context of Tvix as a Nix alternative, a blob represents an -individual file inside a StorePath. -In the Tvix Data Model, this is accomplished by having a `FileNode` (either the -`root_node` in a `PathInfo` message, or a individual file inside a `Directory` -message) encode a BLAKE3 digest. - -However, the whole infrastructure can be applied for other usecases requiring -exchange/storage or access into data of which the blake3 digest is known. - -## Protocol and Interfaces -As an RPC protocol, BlobStore currently uses gRPC. - -On the Rust side of things, every blob service implements the -[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't -gRPC-specific. - -This `BlobService` trait provides functionality to check for existence of Blobs, -read from blobs, and write new blobs. -It also provides a method to ask for more granular chunks if they are available. - -In addition to some in-memory, on-disk and (soon) object-storage-based -implementations, we also have a `BlobService` implementation that talks to a -gRPC server, as well as a gRPC server wrapper component, which provides a gRPC -service for anything implementing the `BlobService` trait. - -This makes it very easy to talk to a remote `BlobService`, which does not even -need to be written in the same language, as long it speaks the same gRPC -protocol. - -It also puts very little requirements on someone implementing a new -`BlobService`, and how its internal storage or chunking algorithm looks like. - -The gRPC protocol is documented in `../protos/rpc_blobstore.proto`. -Contrary to the `BlobService` trait, it does not have any options for seeking/ -ranging, as it's more desirable to provide this through chunking (see also -`./blobstore-chunking.md`). - -## Composition -Different `BlobStore` are supposed to be "composed"/"layered" to express -caching, multiple local and remote sources. - -The fronting interface can be the same, it'd just be multiple "tiers" that can -respond to requests, depending on where the data resides. [^1] - -This makes it very simple for consumers, as they don't need to be aware of the -entire substitutor config. - -The flexibility of this doesn't need to be exposed to the user in the default -case; in most cases we should be fine with some form of on-disk storage and a -bunch of substituters with different priorities. - -### gRPC Clients -Clients are encouraged to always read blobs in a chunked fashion (asking for a -list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via -`BlobService.Read()` as needed), instead of directly reading the entire blob via -`BlobService.Read()`. - -In a composition setting, this provides opportunity for caching, and avoids -downloading some chunks if they're already present locally (for example, because -they were already downloaded by reading from a similar blob earlier). - -It also removes the need for seeking to be a part of the gRPC protocol -alltogether, as chunks are supposed to be "reasonably small" [^2]. - -There's some further optimization potential, a `BlobService.Stat()` request -could tell the server it's happy with very small blobs just being inlined in -an additional additional field in the response, which would allow clients to -populate their local chunk store in a single roundtrip. - -## Verified Streaming -As already described in `./docs/blobstore-chunking.md`, the physical chunk -information sent in a `BlobService.Stat()` response is still sufficient to fetch -in an authenticated fashion. - -The exact protocol and formats are still a bit in flux, but here's some notes: - - - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a - [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling - how detailed (down to the leaf chunks) it should go. - The exact format (and request fields) still need to be defined, edef has some - ideas around omitting some intermediate hash nodes over the wire and - recomputing them, reducing size by another ~50% over [bao-tree]. - - `BlobService.Stat()` response gets some bao-related fields (`bao_shift` - field, signalling the actual format/shift level the server replies with, the - actual bao, and maybe some format specifier). - It would be nice to also be compatible with the baos used by [iroh], so we - can provide an implementation using it too. - ---- - -[^1]: We might want to have some backchannel, so it becomes possible to provide - feedback to the user that something is downloaded. -[^2]: Something between 512K-4M, TBD. -[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md -[bao-tree]: https://github.com/n0-computer/bao-tree -[iroh]: https://github.com/n0-computer/iroh diff --git a/tvix/castore/docs/data-model.md b/tvix/castore/docs/data-model.md deleted file mode 100644 index 5e6220cc23fa..000000000000 --- a/tvix/castore/docs/data-model.md +++ /dev/null @@ -1,50 +0,0 @@ -# Data model - -This provides some more notes on the fields used in castore.proto. - -See `//tvix/store/docs/api.md` for the full context. - -## Directory message -`Directory` messages use the blake3 hash of their canonical protobuf -serialization as its identifier. - -A `Directory` message contains three lists, `directories`, `files` and -`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages -respectively. They describe all the direct child elements that are contained in -a directory. - -All three message types have a `name` field, specifying the (base)name of the -element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..'). -For reproducibility reasons, the lists MUST be sorted by that name and the -name MUST be unique across all three lists. - -In addition to the `name` field, the various *Node messages have the following -fields: - -## DirectoryNode -A `DirectoryNode` message represents a child directory. - -It has a `digest` field, which points to the identifier of another `Directory` -message, making a `Directory` a merkle tree (or strictly speaking, a graph, as -two elements pointing to a child directory with the same contents would point -to the same `Directory` message). - -There's also a `size` field, containing the (total) number of all child -elements in the referenced `Directory`, which helps for inode calculation. - -## FileNode -A `FileNode` message represents a child (regular) file. - -Its `digest` field contains the blake3 hash of the file contents. It can be -looked up in the `BlobService`. - -The `size` field contains the size of the blob the `digest` field refers to. - -The `executable` field specifies whether the file should be marked as -executable or not. - -## SymlinkNode -A `SymlinkNode` message represents a child symlink. - -In addition to the `name` field, the only additional field is the `target`, -which is a string containing the target of the symlink. diff --git a/tvix/castore/docs/why-not-git-trees.md b/tvix/castore/docs/why-not-git-trees.md deleted file mode 100644 index 4a12b4ef5554..000000000000 --- a/tvix/castore/docs/why-not-git-trees.md +++ /dev/null @@ -1,57 +0,0 @@ -## Why not git tree objects? - -We've been experimenting with (some variations of) the git tree and object -format, and ultimately decided against using it as an internal format, and -instead adapted the one documented in the other documents here. - -While the tvix-store API protocol shares some similarities with the format used -in git for trees and objects, the git one has shown some significant -disadvantages: - -### The binary encoding itself - -#### trees -The git tree object format is a very binary, error-prone and -"made-to-be-read-and-written-from-C" format. - -Tree objects are a combination of null-terminated strings, and fields of known -length. References to other tree objects use the literal sha1 hash of another -tree object in this encoding. -Extensions of the format/changes are very hard to do right, because parsers are -not aware they might be parsing something different. - -The tvix-store protocol uses a canonical protobuf serialization, and uses -the [blake3][blake3] hash of that serialization to point to other `Directory` -messages. -It's both compact and with a wide range of libraries for encoders and decoders -in many programming languages. -The choice of protobuf makes it easy to add new fields, and make old clients -aware of some unknown fields being detected [^adding-fields]. - -#### blob -On disk, git blob objects start with a "blob" prefix, then the size of the -payload, and then the data itself. The hash of a blob is the literal sha1sum -over all of this - which makes it something very git specific to request for. - -tvix-store simply uses the [blake3][blake3] hash of the literal contents -when referring to a file/blob, which makes it very easy to ask other data -sources for the same data, as no git-specific payload is included in the hash. -This also plays very well together with things like [iroh][iroh-discussion], -which plans to provide a way to substitute (large)blobs by their blake3 hash -over the IPFS network. - -In addition to that, [blake3][blake3] makes it possible to do -[verified streaming][bao], as already described in other parts of the -documentation. - -The git tree object format uses sha1 both for references to other trees and -hashes of blobs, which isn't really a hash function to fundamentally base -everything on in 2023. -The [migration to sha256][git-sha256] also has been dead for some years now, -and it's unclear what a "blake3" version of this would even look like. - -[bao]: https://github.com/oconnor663/bao -[blake3]: https://github.com/BLAKE3-team/BLAKE3 -[git-sha256]: https://git-scm.com/docs/hash-function-transition/ -[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197 -[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect. \ No newline at end of file diff --git a/tvix/docs/src/SUMMARY.md b/tvix/docs/src/SUMMARY.md index 5ae1647e4125..7c25c55ee4d9 100644 --- a/tvix/docs/src/SUMMARY.md +++ b/tvix/docs/src/SUMMARY.md @@ -4,6 +4,13 @@ - [Architecture & data flow](./architecture.md) - [TODOs](./TODO.md) +# Store +- [Store API](./store/api.md) +- [BlobStore Chunking](./castore/blobstore-chunking.md) +- [BlobStore Protocol](./castore/blobstore-protocol.md) +- [Data Model](./castore/data-model.md) +- [Why not git trees?](./castore/why-not-git-trees.md) + # Nix - [Specification of the Nix Language](./language-spec.md) - [Nix language version history](./lang-version.md) diff --git a/tvix/docs/src/castore/blobstore-chunking.md b/tvix/docs/src/castore/blobstore-chunking.md new file mode 100644 index 000000000000..df3c29680257 --- /dev/null +++ b/tvix/docs/src/castore/blobstore-chunking.md @@ -0,0 +1,147 @@ +# BlobStore: Chunking & Verified Streaming + +`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3] +as hash function. + +Returned data is fetched by using the digest as lookup key, and can be verified +to be correct by feeding the received data through the hash function and +ensuring it matches the digest initially used for the lookup. + +This means, data can be downloaded by any untrusted third-party as well, as the +received data is validated to match the digest it was originally requested with. + +However, for larger blobs of data, having to download the entire blob at once is +wasteful, if we only care about a part of the blob. Think about mounting a +seekable data structure, like loop-mounting an .iso file, or doing partial reads +in a large Parquet file, a column-oriented data format. + +> We want to have the possibility to *seek* into a larger file. + +This however shouldn't compromise on data integrity properties - we should not +need to trust a peer we're downloading from to be "honest" about the partial +data we're reading. We should be able to verify smaller reads. + +Especially when substituting from an untrusted third-party, we want to be able +to detect quickly if that third-party is sending us wrong data, and terminate +the connection early. + +## Chunking +In content-addressed systems, this problem has historically been solved by +breaking larger blobs into smaller chunks, which can be fetched individually, +and making a hash of *this listing* the blob digest/identifier. + + - BitTorrent for example breaks files up into smaller chunks, and maintains + a list of sha1 digests for each of these chunks. Magnet links contain a + digest over this listing as an identifier. (See [bittorrent-v2][here for + more details]). + With the identifier, a client can fetch the entire list, and then recursively + "unpack the graph" of nodes, until it ends up with a list of individual small + chunks, which can be fetched individually. + - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the + digest of the root node as an identitier. + +These approaches solve the problem of being able to fetch smaller chunks in a +trusted fashion. They can also do some deduplication, in case there's the same +leaf nodes same leaf nodes in multiple places. + +However, they also have a big disadvantage. The chunking parameters, and the +"topology" of the graph structure itself "bleed" into the root hash of the +entire data structure itself. + +Depending on the chunking parameters used, there's different representations for +the same data, causing less data sharing/reuse in the overall system, in terms of how +many chunks need to be downloaded vs. are already available locally, as well as +how compact data is stored on-disk. + +This can be workarounded by agreeing on only a single way of chunking, but it's +not pretty and misses a lot of deduplication potential. + +### Chunking in Tvix' Blobstore +tvix-castore's BlobStore uses a hybrid approach to eliminate some of the +disadvantages, while still being content-addressed internally, with the +highlighted benefits. + +It uses [blake3] as hash function, and the blake3 digest of **the raw data +itself** as an identifier (rather than some application-specific Merkle DAG that +also embeds some chunking information). + +BLAKE3 is a tree hash where all left nodes fully populated, contrary to +conventional serial hash functions. To be able to validate the hash of a node, +one only needs the hash of the (2) children [^1], if any. + +This means one only needs to the root digest to validate a constructions, and these +constructions can be sent [separately][bao-spec]. + +This relieves us from the need of having to encode more granular chunking into +our data model / identifier upfront, but can make this mostly a transport/ +storage concern. + +For some more description on the (remote) protocol, check +`./blobstore-protocol.md`. + +#### Logical vs. physical chunking + +Due to the properties of the BLAKE3 hash function, we have logical blocks of +1KiB, but this doesn't necessarily imply we need to restrict ourselves to these +chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers, +or are stored on-disk. + +The only thing we need to be able to read and verify an arbitrary byte range is +having the covering range of aligned 1K blocks, and a construction from the root +digest to the 1K block. + +Note the intermediate hash tree can be further trimmed, [omitting][bao-tree] +lower parts of the tree while still providing verified streaming - at the cost +of having to fetch larger covering ranges of aligned blocks. + +Let's pick an example. We identify each KiB by a number here for illustrational +purposes. + +Assuming we omit the last two layers of the hash tree, we end up with logical +4KiB leaf chunks (`bao_shift` of `2`). + +For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`, +`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion: + +`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]` + +Assuming the server now informs us about the following physical chunking: + +``` +[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]` +``` + +If our application now wants to arbitrarily read from 0 until 4 (inclusive): + +``` +[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] + |-------------| + +``` + +…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`. + + +`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're +interested in. + +We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we +can assemble `[ 4 5 6 7 ]` to verify both logical chunks: + +``` +[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ] +^ ^ ^ ^ +|----4KiB----|------4KiB-----| +``` + +Each physical chunk fetched can be validated to have the blake3 digest that was +communicated upfront, and can be stored in a client-side cache/storage, so +subsequent / other requests for the same data will be fast(er). + +--- + +[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree +[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/ +[blake3]: https://github.com/BLAKE3-team/BLAKE3 +[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md +[bao-tree]: https://github.com/n0-computer/bao-tree diff --git a/tvix/docs/src/castore/blobstore-protocol.md b/tvix/docs/src/castore/blobstore-protocol.md new file mode 100644 index 000000000000..048cafc3d877 --- /dev/null +++ b/tvix/docs/src/castore/blobstore-protocol.md @@ -0,0 +1,104 @@ +# BlobStore: Protocol / Composition + +This documents describes the protocol that BlobStore uses to substitute blobs +other ("remote") BlobStores. + +How to come up with the blake3 digest of the blob to fetch is left to another +layer in the stack. + +To put this into the context of Tvix as a Nix alternative, a blob represents an +individual file inside a StorePath. +In the Tvix Data Model, this is accomplished by having a `FileNode` (either the +`root_node` in a `PathInfo` message, or a individual file inside a `Directory` +message) encode a BLAKE3 digest. + +However, the whole infrastructure can be applied for other usecases requiring +exchange/storage or access into data of which the blake3 digest is known. + +## Protocol and Interfaces +As an RPC protocol, BlobStore currently uses gRPC. + +On the Rust side of things, every blob service implements the +[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't +gRPC-specific. + +This `BlobService` trait provides functionality to check for existence of Blobs, +read from blobs, and write new blobs. +It also provides a method to ask for more granular chunks if they are available. + +In addition to some in-memory, on-disk and (soon) object-storage-based +implementations, we also have a `BlobService` implementation that talks to a +gRPC server, as well as a gRPC server wrapper component, which provides a gRPC +service for anything implementing the `BlobService` trait. + +This makes it very easy to talk to a remote `BlobService`, which does not even +need to be written in the same language, as long it speaks the same gRPC +protocol. + +It also puts very little requirements on someone implementing a new +`BlobService`, and how its internal storage or chunking algorithm looks like. + +The gRPC protocol is documented in `../protos/rpc_blobstore.proto`. +Contrary to the `BlobService` trait, it does not have any options for seeking/ +ranging, as it's more desirable to provide this through chunking (see also +`./blobstore-chunking.md`). + +## Composition +Different `BlobStore` are supposed to be "composed"/"layered" to express +caching, multiple local and remote sources. + +The fronting interface can be the same, it'd just be multiple "tiers" that can +respond to requests, depending on where the data resides. [^1] + +This makes it very simple for consumers, as they don't need to be aware of the +entire substitutor config. + +The flexibility of this doesn't need to be exposed to the user in the default +case; in most cases we should be fine with some form of on-disk storage and a +bunch of substituters with different priorities. + +### gRPC Clients +Clients are encouraged to always read blobs in a chunked fashion (asking for a +list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via +`BlobService.Read()` as needed), instead of directly reading the entire blob via +`BlobService.Read()`. + +In a composition setting, this provides opportunity for caching, and avoids +downloading some chunks if they're already present locally (for example, because +they were already downloaded by reading from a similar blob earlier). + +It also removes the need for seeking to be a part of the gRPC protocol +alltogether, as chunks are supposed to be "reasonably small" [^2]. + +There's some further optimization potential, a `BlobService.Stat()` request +could tell the server it's happy with very small blobs just being inlined in +an additional additional field in the response, which would allow clients to +populate their local chunk store in a single roundtrip. + +## Verified Streaming +As already described in `./docs/blobstore-chunking.md`, the physical chunk +information sent in a `BlobService.Stat()` response is still sufficient to fetch +in an authenticated fashion. + +The exact protocol and formats are still a bit in flux, but here's some notes: + + - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a + [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling + how detailed (down to the leaf chunks) it should go. + The exact format (and request fields) still need to be defined, edef has some + ideas around omitting some intermediate hash nodes over the wire and + recomputing them, reducing size by another ~50% over [bao-tree]. + - `BlobService.Stat()` response gets some bao-related fields (`bao_shift` + field, signalling the actual format/shift level the server replies with, the + actual bao, and maybe some format specifier). + It would be nice to also be compatible with the baos used by [iroh], so we + can provide an implementation using it too. + +--- + +[^1]: We might want to have some backchannel, so it becomes possible to provide + feedback to the user that something is downloaded. +[^2]: Something between 512K-4M, TBD. +[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md +[bao-tree]: https://github.com/n0-computer/bao-tree +[iroh]: https://github.com/n0-computer/iroh diff --git a/tvix/docs/src/castore/data-model.md b/tvix/docs/src/castore/data-model.md new file mode 100644 index 000000000000..5e6220cc23fa --- /dev/null +++ b/tvix/docs/src/castore/data-model.md @@ -0,0 +1,50 @@ +# Data model + +This provides some more notes on the fields used in castore.proto. + +See `//tvix/store/docs/api.md` for the full context. + +## Directory message +`Directory` messages use the blake3 hash of their canonical protobuf +serialization as its identifier. + +A `Directory` message contains three lists, `directories`, `files` and +`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages +respectively. They describe all the direct child elements that are contained in +a directory. + +All three message types have a `name` field, specifying the (base)name of the +element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..'). +For reproducibility reasons, the lists MUST be sorted by that name and the +name MUST be unique across all three lists. + +In addition to the `name` field, the various *Node messages have the following +fields: + +## DirectoryNode +A `DirectoryNode` message represents a child directory. + +It has a `digest` field, which points to the identifier of another `Directory` +message, making a `Directory` a merkle tree (or strictly speaking, a graph, as +two elements pointing to a child directory with the same contents would point +to the same `Directory` message). + +There's also a `size` field, containing the (total) number of all child +elements in the referenced `Directory`, which helps for inode calculation. + +## FileNode +A `FileNode` message represents a child (regular) file. + +Its `digest` field contains the blake3 hash of the file contents. It can be +looked up in the `BlobService`. + +The `size` field contains the size of the blob the `digest` field refers to. + +The `executable` field specifies whether the file should be marked as +executable or not. + +## SymlinkNode +A `SymlinkNode` message represents a child symlink. + +In addition to the `name` field, the only additional field is the `target`, +which is a string containing the target of the symlink. diff --git a/tvix/docs/src/castore/why-not-git-trees.md b/tvix/docs/src/castore/why-not-git-trees.md new file mode 100644 index 000000000000..4a12b4ef5554 --- /dev/null +++ b/tvix/docs/src/castore/why-not-git-trees.md @@ -0,0 +1,57 @@ +## Why not git tree objects? + +We've been experimenting with (some variations of) the git tree and object +format, and ultimately decided against using it as an internal format, and +instead adapted the one documented in the other documents here. + +While the tvix-store API protocol shares some similarities with the format used +in git for trees and objects, the git one has shown some significant +disadvantages: + +### The binary encoding itself + +#### trees +The git tree object format is a very binary, error-prone and +"made-to-be-read-and-written-from-C" format. + +Tree objects are a combination of null-terminated strings, and fields of known +length. References to other tree objects use the literal sha1 hash of another +tree object in this encoding. +Extensions of the format/changes are very hard to do right, because parsers are +not aware they might be parsing something different. + +The tvix-store protocol uses a canonical protobuf serialization, and uses +the [blake3][blake3] hash of that serialization to point to other `Directory` +messages. +It's both compact and with a wide range of libraries for encoders and decoders +in many programming languages. +The choice of protobuf makes it easy to add new fields, and make old clients +aware of some unknown fields being detected [^adding-fields]. + +#### blob +On disk, git blob objects start with a "blob" prefix, then the size of the +payload, and then the data itself. The hash of a blob is the literal sha1sum +over all of this - which makes it something very git specific to request for. + +tvix-store simply uses the [blake3][blake3] hash of the literal contents +when referring to a file/blob, which makes it very easy to ask other data +sources for the same data, as no git-specific payload is included in the hash. +This also plays very well together with things like [iroh][iroh-discussion], +which plans to provide a way to substitute (large)blobs by their blake3 hash +over the IPFS network. + +In addition to that, [blake3][blake3] makes it possible to do +[verified streaming][bao], as already described in other parts of the +documentation. + +The git tree object format uses sha1 both for references to other trees and +hashes of blobs, which isn't really a hash function to fundamentally base +everything on in 2023. +The [migration to sha256][git-sha256] also has been dead for some years now, +and it's unclear what a "blake3" version of this would even look like. + +[bao]: https://github.com/oconnor663/bao +[blake3]: https://github.com/BLAKE3-team/BLAKE3 +[git-sha256]: https://git-scm.com/docs/hash-function-transition/ +[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197 +[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect. \ No newline at end of file diff --git a/tvix/docs/src/store/api.md b/tvix/docs/src/store/api.md new file mode 100644 index 000000000000..c5a5c477aa17 --- /dev/null +++ b/tvix/docs/src/store/api.md @@ -0,0 +1,288 @@ +tvix-[ca]store API +============== + +This document outlines the design of the API exposed by tvix-castore and tvix- +store, as well as other implementations of this store protocol. + +This document is meant to be read side-by-side with +[castore.md](../../castore/docs/data-model.md) which describes the data model +in more detail. + +The store API has four main consumers: + +1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix + case) communicates with the store to: + + * Upload files and directories (e.g. from `builtins.path`, or `src = ./path` + Nix expressions). + * Read files from the store where necessary (e.g. when `nixpkgs` is + located in the store, or for IFD). + +2. The builder communicates with the store to: + + * Upload files and directories after a build, to persist build artifacts in + the store. + +3. Tvix clients (such as users that have Tvix installed, or, depending + on perspective, builder environments) expect the store to + "materialise" on disk to provide a directory layout with store + paths. + +4. Stores may communicate with other stores, to substitute already built store + paths, i.e. a store acts as a binary cache for other stores. + +The store API attempts to reuse parts of its API between these three +consumers by making similarities explicit in the protocol. This leads +to a protocol that is slightly more complex than a simple "file +upload/download"-system, but at significantly greater efficiency, both in terms +of deduplication opportunities as well as granularity. + +## The Store model + +Contents inside a tvix-store can be grouped into three different message types: + + * Blobs + * Directories + * PathInfo (see further down) + +(check `castore.md` for more detailed field descriptions) + +### Blobs +A blob object contains the literal file contents of regular (or executable) +files. + +### Directory +A directory object describes the direct children of a directory. + +It contains: + - name of child (regular or executable) files, and their [blake3][blake3] hash. + - name of child symlinks, and their target (as string) + - name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG) + +### Content-addressed Store Model +For example, lets consider a directory layout like this, with some +imaginary hashes of file contents: + +``` +. +├── file-1.txt hash: 5891b5b522d5df086d0ff0b110fb +└── nested + └── file-2.txt hash: abc6fd595fc079d3114d4b71a4d8 +``` + +A hash for the *directory* `nested` can be created by creating the `Directory` +object: + +```json +{ + "directories": [], + "files": [{ + "name": "file-2.txt", + "digest": "abc6fd595fc079d3114d4b71a4d8", + "size": 123, + }], + "symlink": [], +} +``` + +And then hashing a serialised form of that data structure. We use the blake3 +hash of the canonical protobuf representation. Let's assume the hash was +`ff0029485729bcde993720749232`. + +To create the directory object one layer up, we now refer to our `nested` +directory object in `directories`, and to `file-1.txt` in `files`: + +```json +{ + "directories": [{ + "name": "nested", + "digest": "ff0029485729bcde993720749232", + "size": 1, + }], + "files": [{ + "name": "file-1.txt", + "digest": "5891b5b522d5df086d0ff0b110fb", + "size": 124, + }] +} +``` + +This Merkle DAG of Directory objects, and flat store of blobs can be used to +describe any file/directory/symlink inside a store path. Due to its content- +addressed nature, it'll automatically deduplicate (re-)used (sub)directories, +and allow substitution from any (untrusted) source. + +The thing that's now only missing is the metadata to map/"mount" from the +content-addressed world to a physical path. + +### PathInfo +As most paths in the Nix store currently are input-addressed [^input-addressed], +and the `tvix-castore` data model is also not intrinsically using NAR hashes, +we need something mapping from an input-addressed "output path hash" (or a Nix- +specific content-addressed path) to the contents in the `tvix-castore` world. + +That's what `PathInfo` provides. It embeds the root node (Directory, File or +Symlink) at a given store path. + +The root nodes' `name` field is populated with the (base)name inside +`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`. + +The `PathInfo` message also stores references to other store paths, and some +more NARInfo-specific metadata (signatures, narhash, narsize). + + +## API overview + +There's three different services: + +### BlobService +`BlobService` can be used to store and retrieve blobs of data, used to host +regular file contents. + +It is content-addressed, using [blake3][blake3] +as a hashing function. + +As blake3 is a tree hash, there's an opportunity to do +[verified streaming][bao] of parts of the file, +which doesn't need to trust any more information than the root hash itself. +Future extensions of the `BlobService` protocol will enable this. + +### DirectoryService +`DirectoryService` allows lookups (and uploads) of `Directory` messages, and +whole reference graphs of them. + + +### PathInfoService +The PathInfo service provides lookups from a store path hash to a `PathInfo` +message. + +## Example flows + +Below there are some common use cases of tvix-store, and how the different +services are used. + +### Upload files and directories +This is needed for `builtins.path` or `src = ./path` in Nix expressions (A), as +well as for uploading build artifacts to a store (B). + +The path specified needs to be (recursively, BFS-style) traversed. + * All file contents need to be hashed with blake3, and submitted to the + *BlobService* if not already present. + A reference to them needs to be added to the parent Directory object that's + constructed. + * All symlinks need to be added to the parent directory they reside in. + * Whenever a Directory has been fully traversed, it needs to be uploaded to + the *DirectoryService* and a reference to it needs to be added to the parent + Directory object. + +Most of the hashing / directory traversal/uploading can happen in parallel, +as long as Directory objects only refer to Directory objects and Blobs that +have already been uploaded. + +When reaching the root, a `PathInfo` object needs to be constructed. + + * In the case of content-addressed paths (A), the name of the root node is + based on the NAR representation of the contents. + It might make sense to be able to offload the NAR calculation to the store, + which can cache it. + * In the case of build artifacts (B), the output path is input-addressed and + known upfront. + +Contrary to Nix, this has the advantage of not having to upload a lot of things +to the store that didn't change. + +### Reading files from the store from the evaluator +This is the case when `nixpkgs` is located in the store, or IFD in general. + +The store client asks the `PathInfoService` for the `PathInfo` of the output +path in the request, and looks at the root node. + +If something other than the root of the store path is requested, like for +example `maintainers/maintainer-list.nix`, the root_node Directory is inspected +and potentially a chain of `Directory` objects requested from +*DirectoryService*. [^n+1query]. + +When the desired file is reached, the *BlobService* can be used to read the +contents of this file, and return it back to the evaluator. + +FUTUREWORK: define how importing from symlinks should/does work. + +Contrary to Nix, this has the advantage of not having to copy all of the +contents of a store path to the evaluating machine, but really only fetching +the files the evaluator currently cares about. + +### Materializing store paths on disk +This is useful for people running a Tvix-only system, or running builds on a +"Tvix remote builder" in its own mount namespace. + +In a system with Nix installed, we can't simply manually "extract" things to +`/nix/store`, as Nix assumes to own all writes to this location. +In these use cases, we're probably better off exposing a tvix-store as a local +binary cache (that's what `//tvix/nar-bridge-go` does). + +Assuming we are in an environment where we control `/nix/store` exclusively, a +"realize to disk" would either "extract" things from the `tvix-store` to a +filesystem, or expose a `FUSE`/`virtio-fs` filesystem. + +The latter is already implemented, and particularly interesting for (remote) +build workloads, as build inputs can be realized on-demand, which saves copying +around a lot of never- accessed files. + +In both cases, the API interactions are similar. + * The *PathInfoService* is asked for the `PathInfo` of the requested store path. + * If everything should be "extracted", the *DirectoryService* is asked for all + `Directory` objects in the closure, the file structure is created, all Blobs + are downloaded and placed in their corresponding location and all symlinks + are created accordingly. + * If this is a FUSE filesystem, we can decide to only request a subset, + similar to the "Reading files from the store from the evaluator" use case, + even though it might make sense to keep all Directory objects around. + (See the caveat in "Trust model" though!) + +### Stores communicating with other stores +The gRPC API exposed by the tvix-store allows composing multiple stores, and +implementing some caching strategies, that store clients don't need to be aware +of. + + * For example, a caching strategy could have a fast local tvix-store, that's + asked first and filled with data from a slower remote tvix-store. + + * Multiple stores could be asked for the same data, and whatever store returns + the right data first wins. + + +## Trust model / Distribution +As already described above, the only non-content-addressed service is the +`PathInfo` service. + +This means, all other messages (such as `Blob` and `Directory` messages) can be +substituted from many different, untrusted sources/mirrors, which will make +plugging in additional substitution strategies like IPFS, local network +neighbors super simple. That's also why it's living in the `tvix-castore` crate. + +As for `PathInfo`, we don't specify an additional signature mechanism yet, but +carry the NAR-based signatures from Nix along. + +This means, if we don't trust a remote `PathInfo` object, we currently need to +"stream" the NAR representation to validate these signatures. + +However, the slow part is downloading of NAR files, and considering we have +more granularity available, we might only need to download some small blobs, +rather than a whole NAR file. + +A future signature mechanism, that is only signing (parts of) the `PathInfo` +message, which only points to content-addressed data will enable verified +partial access into a store path, opening up opportunities for lazy filesystem +access etc. + + + +[blake3]: https://github.com/BLAKE3-team/BLAKE3 +[bao]: https://github.com/oconnor663/bao +[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing + some replacements on refered Input Derivations to calculate + output paths. +[^n+1query]: This would expose an N+1 query problem. However it's not a problem + in practice, as there's usually always a "local" caching store in + the loop, and *DirectoryService* supports a recursive lookup for + all `Directory` children of a `Directory` diff --git a/tvix/store/default.nix b/tvix/store/default.nix index 78b499114cae..3fe47fe60b11 100644 --- a/tvix/store/default.nix +++ b/tvix/store/default.nix @@ -33,7 +33,7 @@ in })).overrideAttrs (old: rec { meta.ci = { targets = [ "integration-tests" ] ++ lib.filter (x: lib.hasPrefix "with-features" x || x == "no-features") (lib.attrNames passthru); - extraSteps.import-docs = (mkImportCheck "tvix/store/docs" ./docs); + extraSteps.import-docs = (mkImportCheck "tvix/docs/src/store" ../docs/src/store); }; passthru = (depot.tvix.utils.mkFeaturePowerset { inherit (old) crateName; diff --git a/tvix/store/docs/api.md b/tvix/store/docs/api.md deleted file mode 100644 index c5a5c477aa17..000000000000 --- a/tvix/store/docs/api.md +++ /dev/null @@ -1,288 +0,0 @@ -tvix-[ca]store API -============== - -This document outlines the design of the API exposed by tvix-castore and tvix- -store, as well as other implementations of this store protocol. - -This document is meant to be read side-by-side with -[castore.md](../../castore/docs/data-model.md) which describes the data model -in more detail. - -The store API has four main consumers: - -1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix - case) communicates with the store to: - - * Upload files and directories (e.g. from `builtins.path`, or `src = ./path` - Nix expressions). - * Read files from the store where necessary (e.g. when `nixpkgs` is - located in the store, or for IFD). - -2. The builder communicates with the store to: - - * Upload files and directories after a build, to persist build artifacts in - the store. - -3. Tvix clients (such as users that have Tvix installed, or, depending - on perspective, builder environments) expect the store to - "materialise" on disk to provide a directory layout with store - paths. - -4. Stores may communicate with other stores, to substitute already built store - paths, i.e. a store acts as a binary cache for other stores. - -The store API attempts to reuse parts of its API between these three -consumers by making similarities explicit in the protocol. This leads -to a protocol that is slightly more complex than a simple "file -upload/download"-system, but at significantly greater efficiency, both in terms -of deduplication opportunities as well as granularity. - -## The Store model - -Contents inside a tvix-store can be grouped into three different message types: - - * Blobs - * Directories - * PathInfo (see further down) - -(check `castore.md` for more detailed field descriptions) - -### Blobs -A blob object contains the literal file contents of regular (or executable) -files. - -### Directory -A directory object describes the direct children of a directory. - -It contains: - - name of child (regular or executable) files, and their [blake3][blake3] hash. - - name of child symlinks, and their target (as string) - - name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG) - -### Content-addressed Store Model -For example, lets consider a directory layout like this, with some -imaginary hashes of file contents: - -``` -. -├── file-1.txt hash: 5891b5b522d5df086d0ff0b110fb -└── nested - └── file-2.txt hash: abc6fd595fc079d3114d4b71a4d8 -``` - -A hash for the *directory* `nested` can be created by creating the `Directory` -object: - -```json -{ - "directories": [], - "files": [{ - "name": "file-2.txt", - "digest": "abc6fd595fc079d3114d4b71a4d8", - "size": 123, - }], - "symlink": [], -} -``` - -And then hashing a serialised form of that data structure. We use the blake3 -hash of the canonical protobuf representation. Let's assume the hash was -`ff0029485729bcde993720749232`. - -To create the directory object one layer up, we now refer to our `nested` -directory object in `directories`, and to `file-1.txt` in `files`: - -```json -{ - "directories": [{ - "name": "nested", - "digest": "ff0029485729bcde993720749232", - "size": 1, - }], - "files": [{ - "name": "file-1.txt", - "digest": "5891b5b522d5df086d0ff0b110fb", - "size": 124, - }] -} -``` - -This Merkle DAG of Directory objects, and flat store of blobs can be used to -describe any file/directory/symlink inside a store path. Due to its content- -addressed nature, it'll automatically deduplicate (re-)used (sub)directories, -and allow substitution from any (untrusted) source. - -The thing that's now only missing is the metadata to map/"mount" from the -content-addressed world to a physical path. - -### PathInfo -As most paths in the Nix store currently are input-addressed [^input-addressed], -and the `tvix-castore` data model is also not intrinsically using NAR hashes, -we need something mapping from an input-addressed "output path hash" (or a Nix- -specific content-addressed path) to the contents in the `tvix-castore` world. - -That's what `PathInfo` provides. It embeds the root node (Directory, File or -Symlink) at a given store path. - -The root nodes' `name` field is populated with the (base)name inside -`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`. - -The `PathInfo` message also stores references to other store paths, and some -more NARInfo-specific metadata (signatures, narhash, narsize). - - -## API overview - -There's three different services: - -### BlobService -`BlobService` can be used to store and retrieve blobs of data, used to host -regular file contents. - -It is content-addressed, using [blake3][blake3] -as a hashing function. - -As blake3 is a tree hash, there's an opportunity to do -[verified streaming][bao] of parts of the file, -which doesn't need to trust any more information than the root hash itself. -Future extensions of the `BlobService` protocol will enable this. - -### DirectoryService -`DirectoryService` allows lookups (and uploads) of `Directory` messages, and -whole reference graphs of them. - - -### PathInfoService -The PathInfo service provides lookups from a store path hash to a `PathInfo` -message. - -## Example flows - -Below there are some common use cases of tvix-store, and how the different -services are used. - -### Upload files and directories -This is needed for `builtins.path` or `src = ./path` in Nix expressions (A), as -well as for uploading build artifacts to a store (B). - -The path specified needs to be (recursively, BFS-style) traversed. - * All file contents need to be hashed with blake3, and submitted to the - *BlobService* if not already present. - A reference to them needs to be added to the parent Directory object that's - constructed. - * All symlinks need to be added to the parent directory they reside in. - * Whenever a Directory has been fully traversed, it needs to be uploaded to - the *DirectoryService* and a reference to it needs to be added to the parent - Directory object. - -Most of the hashing / directory traversal/uploading can happen in parallel, -as long as Directory objects only refer to Directory objects and Blobs that -have already been uploaded. - -When reaching the root, a `PathInfo` object needs to be constructed. - - * In the case of content-addressed paths (A), the name of the root node is - based on the NAR representation of the contents. - It might make sense to be able to offload the NAR calculation to the store, - which can cache it. - * In the case of build artifacts (B), the output path is input-addressed and - known upfront. - -Contrary to Nix, this has the advantage of not having to upload a lot of things -to the store that didn't change. - -### Reading files from the store from the evaluator -This is the case when `nixpkgs` is located in the store, or IFD in general. - -The store client asks the `PathInfoService` for the `PathInfo` of the output -path in the request, and looks at the root node. - -If something other than the root of the store path is requested, like for -example `maintainers/maintainer-list.nix`, the root_node Directory is inspected -and potentially a chain of `Directory` objects requested from -*DirectoryService*. [^n+1query]. - -When the desired file is reached, the *BlobService* can be used to read the -contents of this file, and return it back to the evaluator. - -FUTUREWORK: define how importing from symlinks should/does work. - -Contrary to Nix, this has the advantage of not having to copy all of the -contents of a store path to the evaluating machine, but really only fetching -the files the evaluator currently cares about. - -### Materializing store paths on disk -This is useful for people running a Tvix-only system, or running builds on a -"Tvix remote builder" in its own mount namespace. - -In a system with Nix installed, we can't simply manually "extract" things to -`/nix/store`, as Nix assumes to own all writes to this location. -In these use cases, we're probably better off exposing a tvix-store as a local -binary cache (that's what `//tvix/nar-bridge-go` does). - -Assuming we are in an environment where we control `/nix/store` exclusively, a -"realize to disk" would either "extract" things from the `tvix-store` to a -filesystem, or expose a `FUSE`/`virtio-fs` filesystem. - -The latter is already implemented, and particularly interesting for (remote) -build workloads, as build inputs can be realized on-demand, which saves copying -around a lot of never- accessed files. - -In both cases, the API interactions are similar. - * The *PathInfoService* is asked for the `PathInfo` of the requested store path. - * If everything should be "extracted", the *DirectoryService* is asked for all - `Directory` objects in the closure, the file structure is created, all Blobs - are downloaded and placed in their corresponding location and all symlinks - are created accordingly. - * If this is a FUSE filesystem, we can decide to only request a subset, - similar to the "Reading files from the store from the evaluator" use case, - even though it might make sense to keep all Directory objects around. - (See the caveat in "Trust model" though!) - -### Stores communicating with other stores -The gRPC API exposed by the tvix-store allows composing multiple stores, and -implementing some caching strategies, that store clients don't need to be aware -of. - - * For example, a caching strategy could have a fast local tvix-store, that's - asked first and filled with data from a slower remote tvix-store. - - * Multiple stores could be asked for the same data, and whatever store returns - the right data first wins. - - -## Trust model / Distribution -As already described above, the only non-content-addressed service is the -`PathInfo` service. - -This means, all other messages (such as `Blob` and `Directory` messages) can be -substituted from many different, untrusted sources/mirrors, which will make -plugging in additional substitution strategies like IPFS, local network -neighbors super simple. That's also why it's living in the `tvix-castore` crate. - -As for `PathInfo`, we don't specify an additional signature mechanism yet, but -carry the NAR-based signatures from Nix along. - -This means, if we don't trust a remote `PathInfo` object, we currently need to -"stream" the NAR representation to validate these signatures. - -However, the slow part is downloading of NAR files, and considering we have -more granularity available, we might only need to download some small blobs, -rather than a whole NAR file. - -A future signature mechanism, that is only signing (parts of) the `PathInfo` -message, which only points to content-addressed data will enable verified -partial access into a store path, opening up opportunities for lazy filesystem -access etc. - - - -[blake3]: https://github.com/BLAKE3-team/BLAKE3 -[bao]: https://github.com/oconnor663/bao -[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing - some replacements on refered Input Derivations to calculate - output paths. -[^n+1query]: This would expose an N+1 query problem. However it's not a problem - in practice, as there's usually always a "local" caching store in - the loop, and *DirectoryService* supports a recursive lookup for - all `Directory` children of a `Directory` -- cgit 1.4.1