From e05e8bdd036ea0bd185a2234921f2fcfc146cd55 Mon Sep 17 00:00:00 2001 From: Florian Klink Date: Sat, 26 Nov 2022 21:16:31 +0000 Subject: docs(tvix/store): add README, document services and store model These are intended to help digest the protocol definitions for tvix- store, and how they tie into the whole concept. Co-Authored-By: Vincent Ambo Change-Id: Ic1ba3ba41ef599209453f15d0ac2e07a6144bcca Reviewed-on: https://cl.tvl.fyi/c/depot/+/7439 Tested-by: BuildkiteCI Reviewed-by: tazjin --- tvix/store/README.md | 59 +++++++++ tvix/store/docs/api.md | 279 ++++++++++++++++++++++++++++++++++++++++ tvix/store/docs/castore.md | 50 +++++++ tvix/store/protos/castore.proto | 5 +- 4 files changed, 391 insertions(+), 2 deletions(-) create mode 100644 tvix/store/README.md create mode 100644 tvix/store/docs/api.md create mode 100644 tvix/store/docs/castore.md (limited to 'tvix') diff --git a/tvix/store/README.md b/tvix/store/README.md new file mode 100644 index 0000000000..7844264ca1 --- /dev/null +++ b/tvix/store/README.md @@ -0,0 +1,59 @@ +# //tvix/store + +This contains the code hosting the tvix-store. + +For the local store, Nix realizes files on the filesystem in `/nix/store` (and +maintains some metadata in a SQLite database). For "remote stores", it +communicates this metadata in NAR (Nix ARchive) and NARInfo format. + +Compared to the Nix model, `tvix-store` stores data on a much more granular +level than that, which provides more deduplication possibilities, and more +granular copying. + +However, enough information is preserved to still be able to render NAR and +NARInfo (handled by `//tvix/nar-bridge`). + +## More Information +Check the `protos/` subfolder for the definition of the exact RPC methods and +messages. + + +## Interacting with the GRPC service manually +The shell environment in `//tvix` provides `evans`, which is an interactive +REPL-based gPRC client. + +You can use it to connect to a `tvix-store` and call the various RPC methods. + +```shell +$ cargo run & +$ evans --host localhost --port 8000 -r repl + ______ + | ____| + | |__ __ __ __ _ _ __ ___ + | __| \ \ / / / _. | | '_ \ / __| + | |____ \ V / | (_| | | | | | \__ \ + |______| \_/ \__,_| |_| |_| |___/ + + more expressive universal gRPC client + + +tvix.store.v1@localhost:8000> service BlobService + +tvix.store.v1.BlobService@localhost:8000> call Put --bytes-from-file +data (TYPE_BYTES) => /run/current-system/system +{ + "digest": "KOM3/IHEx7YfInAnlJpAElYezq0Sxn9fRz7xuClwNfA=" +} + +tvix.store.v1.BlobService@localhost:8000> call Get --bytes-as-base64 +digest (TYPE_BYTES) => KOM3/IHEx7YfInAnlJpAElYezq0Sxn9fRz7xuClwNfA= +{ + "data": "eDg2XzY0LWxpbnV4" +} + +$ echo eDg2XzY0LWxpbnV4 | base64 -d +x86_64-linux +``` + +Thanks to `tvix-store` providing gRPC Server Reflection (with `reflection` +feature), you don't need to point `evans` to the `.proto` files. diff --git a/tvix/store/docs/api.md b/tvix/store/docs/api.md new file mode 100644 index 0000000000..9d2cefa142 --- /dev/null +++ b/tvix/store/docs/api.md @@ -0,0 +1,279 @@ +tvix-store API +============== + +This document outlines the design of the API exposed by tvix-store, as +well as other implementations of this store protocol. + +The store API has four main consumers: + +1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix + case) communicates with the store to: + + * Upload files and directories (e.g. from `builtins.path`, or `src = ./path` + Nix expressions). + * Read files from the store where necessary (e.g. when `nixpkgs` is + located in the store, or for IFD). + +2. The builder communicates with the store to: + + * Upload files and directories after a build, to persist build artifacts in + the store. + +3. Tvix clients (such as users that have Tvix installed, or, depending + on perspective, builder environments) expect the store to + "materialise" on disk to provide a directory layout with store + paths. + +4. Stores may communicate with other stores, to substitute already built store + paths, i.e. a store acts as a binary cache for other stores. + +The store API attempts to reuse parts of its API between these three +consumers by making similarities explicit in the protocol. This leads +to a protocol that is slightly more complex than a simple "file +upload/download"-system, but at significantly greater efficiency, both in terms +of deduplication opportunities as well as granularity. + +## The Store model + +Contents inside a tvix-store can be grouped into three different message types: + + * Blobs + * Directories + * PathInfo (see further down) + +(check `castore.md` for more detailled field descriptions) + +### Blobs +A blob object contains the literal file contents of regular (or executable) +files. + +### Directory +A directory object describes the direct children of a directory. + +It contains: + - name of child regular (or executable) files, and their [blake3][blake3] hash. + - name of child symlinks, and their target (as string) + - name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG) + +### Content-addressed Store Model +For example, lets consider a directory layout like this, with some +imaginary hashes of file contents: + +``` +. +├── file-1.txt hash: 5891b5b522d5df086d0ff0b110fb +└── nested + └── file-2.txt hash: abc6fd595fc079d3114d4b71a4d8 +``` + +A hash for the *directory* `nested` can be created by creating the `Directory` +object: + +```json +{ + "directories": [], + "files": [{ + "name": "file-2.txt", + "digest": "abc6fd595fc079d3114d4b71a4d8", + "size": 123, + }], + "symlink": [], +} +``` + +And then hashing a serialised form of that data structure. We use the blake3 +hash of the canonical protobuf representation. Let's assume the hash was +`ff0029485729bcde993720749232`. + +To create the directory object one layer up, we now refer to our `nested` +directory object in `directories`, and to `file-1.txt` in `files`: + +```json +{ + "directories": [{ + "name": "nested", + "digest": "ff0029485729bcde993720749232", + "size": 1, + }], + "files": [{ + "name": "file-1.txt", + "digest": "5891b5b522d5df086d0ff0b110fb", + "size": 124, + }] +} +``` + +This Merkle DAG of Directory objects, and flat store of blobs can be used to +describe any file/directory/symlink inside a store path. Due to its content- +addressed nature, it'll automatically deduplicate (re-)used (sub)directories, +and allow substitution from any (untrusted) source. + +The thing that's now only missing is the metadata to map/"mounting" from the +content-addressed world to a physical path. + +### PathInfo +As most paths in the Nix store currently are input-addressed [^input-addressed], +we need something mapping from an input-addressed "output path hash" to the +contents in the content- addressed world. + +That's what `PathInfo` provides. It embeds the root node (Directory, File or +Symlink) at a given store path. + +The root nodes' `name` field is populated with the (base)name inside +`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`. + +The `PathInfo` message also stores references to other store paths, and some +more NARInfo-specific metadata (signatures, narhash, narsize). + + +## API overview + +There's three different services: + +### BlobService +`BlobService` can be used to store and retrieve blobs of data, used to host +regular file contents. + +It is content-addressed, using [blake3](https://github.com/BLAKE3-team/BLAKE3) +as a hashing function. + +As blake3 is a tree hash, there's an opportunity to do +[verified streaming](https://github.com/oconnor663/bao) of parts of the file, +which doesn't need to trust any more information than the root hash itself. +Future extensions of the `BlobService` protocol will enable this. + +### DirectoryService +`DirectoryService` allows lookups (and uploads) of `Directory` messages, and +whole reference graphs of them. + + +### PathInfoService +The PathInfo service provides lookups from an output path hash to a `PathInfo` +message. + +## Example flows + +Below there are some common usecases of tvix-store, and how the different +services are used. + +### Upload files and directories +This needed for `builtins.path` or `src = ./path` in Nix expressions (A), as +well as for uploading build artifacts to a store (B). + +The path specified needs to be (recursively, BFS-style) traversed. + * All file contents need to be hashed with blake3, and submitted to the + *BlobService* if not already present. + A reference to them needs to be added to the parent Directory object that's + constructed. + * All symlinks need to be added to the parent directory they reside in. + * Whenever a Directory has been fully traversed, it needs to be uploaded to + the *DirectoryService* and a reference to it needs to be added to the parent + Directory object. + +Most of the hashing / directory traversal/uploading can happen in parallel, +as long as Directory objects only refer to Directory objects and Blobs that +have already been uploaded. + +When reaching the root, a `PathInfo` object needs to be constructed. + + * In the case of content-addressed paths (A), the name of the root node is + based on the NAR representation of the contents. + It might make sense to be able to offload the NAR calculation to the store, + which can cache it. + * In the case of build artifacts (B), the output path is input-addressed and + known upfront. + +Contrary to Nix, this has the advantage of not having to upload a lot of things +to the store that didn't change. + +### Reading files from the store from the evaluator +This is the case when `nixpkgs` is located in the store, or IFD in general. + +The store client asks the `PathInfoService` for the `PathInfo` of the output +path in the request, and looks at the root node. + +If something other than the root path is requested, the root_node Directory is +inspected and potentially a chain of `Directory` objects requested from +*DirectoryService*. [^n+1query] + +When the desired file is reached, the *BlobService* can be used to read the +contents of this file, and return it back to the evaluator. + +FUTUREWORK: define how importing from symlinks should/does work. + +Contrary to Nix, this has the advantage of not having to copy all of the +contents of a store path to the evaluating machine, but really only fetching +the files the evaluator currently cares about. + +### Materializing store paths on disk +This is useful for people running a Tvix-only system, or running builds on a +"Tvix remote builder" in its own mount namespace. + +In a system with Nix installed, we can't simply manually "extract" things to +`/nix/store`, as Nix assumes to own all writes to this location. +In these usecases, we're probably better off exposing a tvix-store as a local +binary cache (that's what nar-bridge does). + +Assuming we are in an environment where we control `/nix/store` exclusively, a +"realize to disk" would either "extract" things from the tvix-store to a +filesystem, or expose a FUSE filesystem. The latter would be particularly +interesting for remote build workloads, as build inputs can be realized on- +demand, which saves copying around a lot of never-accessed files. + +In both cases, the API interactions are similar. + * The *PathInfoService* is asked for the `PathInfo` of the requested store path. + * If everything should be "extracted", the *DirectoryService* is asked for all + `Directory` objects in the closure, the file structure is created, all Blobs + are downloaded and placed in their corresponding location and all symlinks + are created accordingly. + * If this is a FUSE filesystem, we can decide to only request a subset, + similar to the "Reading files from the store from the evaluator" usecase, + even though it might make sense to keep all Directory objects around. + (See the caveat in "Trust model" though!) + +### Stores communicating with other stores +The gRPC API exposed by the tvix-store allows composing multiple stores, and +implementing some caching strategies, that store clients don't need to be aware +of. + + * For example, a caching strategy could have a fast local tvix-store, that's + asked first and filled with data from a slower remote tvix-store. + + * Multiple stores could be asked for the same data, and whatever store returns + the right data first wins. + + +## Trust model / Distribution +As already described above, the only non-content-addressed service is the +`PathInfo` service. + +This means, all other messages (such as `Blob` and `Directory` messages) can be +substituted from many different, untrusted sources/mirrors, which will make +plugging in additional substitution strategies like IPFS, local network +neighbors super simple. + +As for `PathInfo`, we don't specify an additional signature mechanism yet, but +carry the NAR-based signatures from Nix along. + +This means, if we don't trust a remote `PathInfo` object, we currently need to +"stream" the NAR representation to validate these signatures. + +However, the slow part is downloading of NAR files, and considering we have +more granularity available, we might only need to download some small blobs, +rather than a whole NAR file. + +A future signature mechanism, that is only signing (parts of) the `PathInfo` +message, which only points to content-addressed data will enable verified +partial access into a store path, opening up opportunities for lazy filesystem +access, which is very useful in remote builder scenarios. + + + +[blake3]: https://github.com/BLAKE3-team/BLAKE3 +[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing + some replacements on refered Input Derivations to calculate + output paths. +[^n+1query]: This would expose an N+1 query problem. However it's not a problem + in practice, as there's usually always a "local" caching store in + the loop, and *DirectoryService* supports a recursive lookup for + all `Directory` children of a `Directory` \ No newline at end of file diff --git a/tvix/store/docs/castore.md b/tvix/store/docs/castore.md new file mode 100644 index 0000000000..f555ba5a86 --- /dev/null +++ b/tvix/store/docs/castore.md @@ -0,0 +1,50 @@ +# //tvix/store/docs/castore.md + +This provides some more notes on the fields used in castore.proto. + +It's meant to supplement `//tvix/store/docs/api.md`. + +## Directory message +`Directory` messages use the blake3 hash of their canonical protobuf +serialization as its identifier. + +A `Directory` message contains three lists, `directories`, `files` and +`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages +respectively. They describe all the direct child elements that are contained in +a directory. + +All three message types have a `name` field, specifying the (base)name of the +element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..'). +For reproducibility reasons, the lists MUST be sorted by that name and also +MUST be unique across all three lists. + +In addition to the `name` field, the various *Node messages have the following +fields: + +## DirectoryNode +A `DirectoryNode` message represents a child directory. + +It has a `digest` field, which points to the identifier of another `Directory` +message, making a `Directory` a merkle tree (or strictly speaking, a graph, as +two elements pointing to a child directory with the same contents would point +to the same `Directory` message. + +There's also a `size` field, containing the (total) number of all child +elements in the referenced `Directory`, which helps for inode calculation. + +## FileNode +A `FileNode` message represents a child (regular) file. + +Its `digest` field contains the blake3 hash of the file contents. It can be +looked up in the `BlobService`. + +The `size` field contains the size of the blob the `digest` field refers to. + +The `executable` field specifies whether the file should be marked as +executable or not. + +## SymlinkNode +A `SymlinkNode` message represents a child symlink. + +In addition to the `name` field, the only additional field is the `target`, +which is a string containing the target of the symlink. diff --git a/tvix/store/protos/castore.proto b/tvix/store/protos/castore.proto index e73160bf9d..3d380c63da 100644 --- a/tvix/store/protos/castore.proto +++ b/tvix/store/protos/castore.proto @@ -9,8 +9,9 @@ package tvix.store.v1; // Each of these nodes have a name attribute, which is the basename in that directory // and node type specific attributes. // The name attribute: -// - may not contain slashes or null bytes -// - needs to be unique across all three lists +// - MUST not contain slashes or null bytes +// - MUST not be '.' or '..' +// - MUST be unique across all three lists // Elements in each list need to be lexicographically ordered by the name // attribute. message Directory { -- cgit 1.4.1