about summary refs log tree commit diff
path: root/tvix/docs
diff options
context:
space:
mode:
authorFlorian Klink <flokli@flokli.de>2024-06-13T19·04+0300
committerclbot <clbot@tvl.fyi>2024-06-14T08·00+0000
commit6947dc4349fa85cb702f46acfe3255c907096b12 (patch)
tree50d50d214a20901b29944e20b722489e0e1c3253 /tvix/docs
parentadc7353bd16d07e13efb7f6a84b9f93601a07705 (diff)
chore(tvix/docs): move [ca]store docs to tvix/docs r/8269
Change-Id: Idd78ffae34b6ea7b93d13de73b98c61a348869fb
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11808
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: flokli <flokli@flokli.de>
Diffstat (limited to 'tvix/docs')
-rw-r--r--tvix/docs/src/SUMMARY.md7
-rw-r--r--tvix/docs/src/castore/blobstore-chunking.md147
-rw-r--r--tvix/docs/src/castore/blobstore-protocol.md104
-rw-r--r--tvix/docs/src/castore/data-model.md50
-rw-r--r--tvix/docs/src/castore/why-not-git-trees.md57
-rw-r--r--tvix/docs/src/store/api.md288
6 files changed, 653 insertions, 0 deletions
diff --git a/tvix/docs/src/SUMMARY.md b/tvix/docs/src/SUMMARY.md
index 5ae1647e4125..7c25c55ee4d9 100644
--- a/tvix/docs/src/SUMMARY.md
+++ b/tvix/docs/src/SUMMARY.md
@@ -4,6 +4,13 @@
 - [Architecture & data flow](./architecture.md)
 - [TODOs](./TODO.md)
 
+# Store
+- [Store API](./store/api.md)
+- [BlobStore Chunking](./castore/blobstore-chunking.md)
+- [BlobStore Protocol](./castore/blobstore-protocol.md)
+- [Data Model](./castore/data-model.md)
+- [Why not git trees?](./castore/why-not-git-trees.md)
+
 # Nix
 - [Specification of the Nix Language](./language-spec.md)
 - [Nix language version history](./lang-version.md)
diff --git a/tvix/docs/src/castore/blobstore-chunking.md b/tvix/docs/src/castore/blobstore-chunking.md
new file mode 100644
index 000000000000..df3c29680257
--- /dev/null
+++ b/tvix/docs/src/castore/blobstore-chunking.md
@@ -0,0 +1,147 @@
+# BlobStore: Chunking & Verified Streaming
+
+`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
+as hash function.
+
+Returned data is fetched by using the digest as lookup key, and can be verified
+to be correct by feeding the received data through the hash function and
+ensuring it matches the digest initially used for the lookup.
+
+This means, data can be downloaded by any untrusted third-party as well, as the
+received data is validated to match the digest it was originally requested with.
+
+However, for larger blobs of data, having to download the entire blob at once is
+wasteful, if we only care about a part of the blob. Think about mounting a
+seekable data structure, like loop-mounting an .iso file, or doing partial reads
+in a large Parquet file, a column-oriented data format.
+
+> We want to have the possibility to *seek* into a larger file.
+
+This however shouldn't compromise on data integrity properties - we should not
+need to trust a peer we're downloading from to be "honest" about the partial
+data we're reading. We should be able to verify smaller reads.
+
+Especially when substituting from an untrusted third-party, we want to be able
+to detect quickly if that third-party is sending us wrong data, and terminate
+the connection early.
+
+## Chunking
+In content-addressed systems, this problem has historically been solved by
+breaking larger blobs into smaller chunks, which can be fetched individually,
+and making a hash of *this listing* the blob digest/identifier.
+
+ - BitTorrent for example breaks files up into smaller chunks, and maintains
+   a list of sha1 digests for each of these chunks. Magnet links contain a
+   digest over this listing as an identifier. (See [bittorrent-v2][here for
+   more details]).
+   With the identifier, a client can fetch the entire list, and then recursively
+   "unpack the graph" of nodes, until it ends up with a list of individual small
+   chunks, which can be fetched individually.
+ - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
+   digest of the root node as an identitier.
+
+These approaches solve the problem of being able to fetch smaller chunks in a
+trusted fashion. They can also do some deduplication, in case there's the same
+leaf nodes same leaf nodes in multiple places.
+
+However, they also have a big disadvantage. The chunking parameters, and the
+"topology" of the graph structure itself "bleed" into the root hash of the
+entire data structure itself.
+
+Depending on the chunking parameters used, there's different representations for
+the same data, causing less data sharing/reuse in the overall system, in terms of how
+many chunks need to be downloaded vs. are already available locally, as well as
+how compact data is stored on-disk.
+
+This can be workarounded by agreeing on only a single way of chunking, but it's
+not pretty and misses a lot of deduplication potential.
+
+### Chunking in Tvix' Blobstore
+tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
+disadvantages, while still being content-addressed internally, with the
+highlighted benefits.
+
+It uses [blake3] as hash function, and the blake3 digest of **the raw data
+itself** as an identifier (rather than some application-specific Merkle DAG that
+also embeds some chunking information).
+
+BLAKE3 is a tree hash where all left nodes fully populated, contrary to
+conventional serial hash functions. To be able to validate the hash of a node,
+one only needs the hash of the (2) children [^1], if any.
+
+This means one only needs to the root digest to validate a constructions, and these
+constructions can be sent [separately][bao-spec].
+
+This relieves us from the need of having to encode more granular chunking into
+our data model / identifier upfront, but can make this mostly a transport/
+storage concern.
+
+For some more description on the (remote) protocol, check
+`./blobstore-protocol.md`.
+
+#### Logical vs. physical chunking
+
+Due to the properties of the BLAKE3 hash function, we have logical blocks of
+1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
+chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
+or are stored on-disk.
+
+The only thing we need to be able to read and verify an arbitrary byte range is
+having the covering range of aligned 1K blocks, and a construction from the root
+digest to the 1K block.
+
+Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
+lower parts of the tree while still providing verified streaming - at the cost
+of having to fetch larger covering ranges of aligned blocks.
+
+Let's pick an example. We identify each KiB by a number here for illustrational
+purposes.
+
+Assuming we omit the last two layers of the hash tree, we end up with logical
+4KiB leaf chunks (`bao_shift` of `2`).
+
+For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
+`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
+
+`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
+
+Assuming the server now informs us about the following physical chunking:
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
+```
+
+If our application now wants to arbitrarily read from 0 until 4 (inclusive):
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
+ |-------------|
+
+```
+
+…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
+
+
+`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
+interested in.
+
+We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
+can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
+^       ^           ^     ^
+|----4KiB----|------4KiB-----|
+```
+
+Each physical chunk fetched can be validated to have the blake3 digest that was
+communicated upfront, and can be stored in a client-side cache/storage, so
+subsequent / other requests for the same data will be fast(er).
+
+---
+
+[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
+[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
+[bao-tree]: https://github.com/n0-computer/bao-tree
diff --git a/tvix/docs/src/castore/blobstore-protocol.md b/tvix/docs/src/castore/blobstore-protocol.md
new file mode 100644
index 000000000000..048cafc3d877
--- /dev/null
+++ b/tvix/docs/src/castore/blobstore-protocol.md
@@ -0,0 +1,104 @@
+# BlobStore: Protocol / Composition
+
+This documents describes the protocol that BlobStore uses to substitute blobs
+other ("remote") BlobStores.
+
+How to come up with the blake3 digest of the blob to fetch is left to another
+layer in the stack.
+
+To put this into the context of Tvix as a Nix alternative, a blob represents an
+individual file inside a StorePath.
+In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
+`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
+message) encode a BLAKE3 digest.
+
+However, the whole infrastructure can be applied for other usecases requiring
+exchange/storage or access into data of which the blake3 digest is known.
+
+## Protocol and Interfaces
+As an RPC protocol, BlobStore currently uses gRPC.
+
+On the Rust side of things, every blob service implements the
+[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
+gRPC-specific.
+
+This `BlobService` trait provides functionality to check for existence of Blobs,
+read from blobs, and write new blobs.
+It also provides a method to ask for more granular chunks if they are available.
+
+In addition to some in-memory, on-disk and (soon) object-storage-based
+implementations, we also have a `BlobService` implementation that talks to a
+gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
+service for anything implementing the `BlobService` trait.
+
+This makes it very easy to talk to a remote `BlobService`, which does not even
+need to be written in the same language, as long it speaks the same gRPC
+protocol.
+
+It also puts very little requirements on someone implementing a new
+`BlobService`, and how its internal storage or chunking algorithm looks like.
+
+The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
+Contrary to the `BlobService` trait, it does not have any options for seeking/
+ranging, as it's more desirable to provide this through chunking (see also
+`./blobstore-chunking.md`).
+
+## Composition
+Different `BlobStore` are supposed to be "composed"/"layered" to express
+caching, multiple local and remote sources.
+
+The fronting interface can be the same, it'd just be multiple "tiers" that can
+respond to requests, depending on where the data resides. [^1]
+
+This makes it very simple for consumers, as they don't need to be aware of the
+entire substitutor config.
+
+The flexibility of this doesn't need to be exposed to the user in the default
+case; in most cases we should be fine with some form of on-disk storage and a
+bunch of substituters with different priorities.
+
+### gRPC Clients
+Clients are encouraged to always read blobs in a chunked fashion (asking for a
+list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
+`BlobService.Read()` as needed), instead of directly reading the entire blob via
+`BlobService.Read()`.
+
+In a composition setting, this provides opportunity for caching, and avoids
+downloading some chunks if they're already present locally (for example, because
+they were already downloaded by reading from a similar blob earlier).
+
+It also removes the need for seeking to be a part of the gRPC protocol
+alltogether, as chunks are supposed to be "reasonably small" [^2].
+
+There's some further optimization potential, a `BlobService.Stat()` request
+could tell the server it's happy with very small blobs just being inlined in
+an additional additional field in the response, which would allow clients to
+populate their local chunk store in a single roundtrip.
+
+## Verified Streaming
+As already described in `./docs/blobstore-chunking.md`, the physical chunk
+information sent in a `BlobService.Stat()` response is still sufficient to fetch
+in an authenticated fashion.
+
+The exact protocol and formats are still a bit in flux, but here's some notes:
+
+ - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
+   [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
+   how detailed (down to the leaf chunks) it should go.
+   The exact format (and request fields) still need to be defined, edef has some
+   ideas around omitting some intermediate hash nodes over the wire and
+   recomputing them, reducing size by another ~50% over [bao-tree].
+ - `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
+   field, signalling the actual format/shift level the server replies with, the
+   actual bao, and maybe some format specifier).
+   It would be nice to also be compatible with the baos used by [iroh], so we
+   can provide an implementation using it too.
+
+---
+
+[^1]: We might want to have some backchannel, so it becomes possible to provide
+      feedback to the user that something is downloaded.
+[^2]: Something between 512K-4M, TBD.
+[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
+[bao-tree]: https://github.com/n0-computer/bao-tree
+[iroh]: https://github.com/n0-computer/iroh
diff --git a/tvix/docs/src/castore/data-model.md b/tvix/docs/src/castore/data-model.md
new file mode 100644
index 000000000000..5e6220cc23fa
--- /dev/null
+++ b/tvix/docs/src/castore/data-model.md
@@ -0,0 +1,50 @@
+# Data model
+
+This provides some more notes on the fields used in castore.proto.
+
+See `//tvix/store/docs/api.md` for the full context.
+
+## Directory message
+`Directory` messages use the blake3 hash of their canonical protobuf
+serialization as its identifier.
+
+A `Directory` message contains three lists, `directories`, `files` and
+`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages
+respectively. They describe all the direct child elements that are contained in
+a directory.
+
+All three message types have a `name` field, specifying the (base)name of the
+element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..').
+For reproducibility reasons, the lists MUST be sorted by that name and the
+name MUST be unique across all three lists.
+
+In addition to the `name` field, the various *Node messages have the following
+fields:
+
+## DirectoryNode
+A `DirectoryNode` message represents a child directory.
+
+It has a `digest` field, which points to the identifier of another `Directory`
+message, making a `Directory` a merkle tree (or strictly speaking, a graph, as
+two elements pointing to a child directory with the same contents would point
+to the same `Directory` message).
+
+There's also a `size` field, containing the (total) number of all child
+elements in the referenced `Directory`, which helps for inode calculation.
+
+## FileNode
+A `FileNode` message represents a child (regular) file.
+
+Its `digest` field contains the blake3 hash of the file contents. It can be
+looked up in the `BlobService`.
+
+The `size` field contains the size of the blob the `digest` field refers to.
+
+The `executable` field specifies whether the file should be marked as
+executable or not.
+
+## SymlinkNode
+A `SymlinkNode` message represents a child symlink.
+
+In addition to the `name` field, the only additional field is the `target`,
+which is a string containing the target of the symlink.
diff --git a/tvix/docs/src/castore/why-not-git-trees.md b/tvix/docs/src/castore/why-not-git-trees.md
new file mode 100644
index 000000000000..4a12b4ef5554
--- /dev/null
+++ b/tvix/docs/src/castore/why-not-git-trees.md
@@ -0,0 +1,57 @@
+## Why not git tree objects?
+
+We've been experimenting with (some variations of) the git tree and object
+format, and ultimately decided against using it as an internal format, and
+instead adapted the one documented in the other documents here.
+
+While the tvix-store API protocol shares some similarities with the format used
+in git for trees and objects, the git one has shown some significant
+disadvantages:
+
+### The binary encoding itself
+
+#### trees
+The git tree object format is a very binary, error-prone and
+"made-to-be-read-and-written-from-C" format.
+
+Tree objects are a combination of null-terminated strings, and fields of known
+length. References to other tree objects use the literal sha1 hash of another
+tree object in this encoding.
+Extensions of the format/changes are very hard to do right, because parsers are
+not aware they might be parsing something different.
+
+The tvix-store protocol uses a canonical protobuf serialization, and uses
+the [blake3][blake3] hash of that serialization to point to other `Directory`
+messages.
+It's both compact and with a wide range of libraries for encoders and decoders
+in many programming languages.
+The choice of protobuf makes it easy to add new fields, and make old clients
+aware of some unknown fields being detected [^adding-fields].
+
+#### blob
+On disk, git blob objects start with a "blob" prefix, then the size of the
+payload, and then the data itself. The hash of a blob is the literal sha1sum
+over all of this - which makes it something very git specific to request for.
+
+tvix-store simply uses the [blake3][blake3] hash of the literal contents
+when referring to a file/blob, which makes it very easy to ask other data
+sources for the same data, as no git-specific payload is included in the hash.
+This also plays very well together with things like [iroh][iroh-discussion],
+which plans to provide a way to substitute (large)blobs by their blake3 hash
+over the IPFS network.
+
+In addition to that, [blake3][blake3] makes it possible to do
+[verified streaming][bao], as already described in other parts of the
+documentation.
+
+The git tree object format uses sha1 both for references to other trees and
+hashes of blobs, which isn't really a hash function to fundamentally base
+everything on in 2023.
+The [migration to sha256][git-sha256] also has been dead for some years now,
+and it's unclear what a "blake3" version of this would even look like.
+
+[bao]: https://github.com/oconnor663/bao
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[git-sha256]: https://git-scm.com/docs/hash-function-transition/
+[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
+[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.
\ No newline at end of file
diff --git a/tvix/docs/src/store/api.md b/tvix/docs/src/store/api.md
new file mode 100644
index 000000000000..c5a5c477aa17
--- /dev/null
+++ b/tvix/docs/src/store/api.md
@@ -0,0 +1,288 @@
+tvix-[ca]store API
+==============
+
+This document outlines the design of the API exposed by tvix-castore and tvix-
+store, as well as other implementations of this store protocol.
+
+This document is meant to be read side-by-side with
+[castore.md](../../castore/docs/data-model.md) which describes the data model
+in more detail.
+
+The store API has four main consumers:
+
+1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix
+   case) communicates with the store to:
+
+   * Upload files and directories (e.g. from `builtins.path`, or `src = ./path`
+     Nix expressions).
+   * Read files from the store where necessary (e.g. when `nixpkgs` is
+     located in the store, or for IFD).
+
+2. The builder communicates with the store to:
+
+   * Upload files and directories after a build, to persist build artifacts in
+     the store.
+
+3. Tvix clients (such as users that have Tvix installed, or, depending
+   on perspective, builder environments) expect the store to
+   "materialise" on disk to provide a directory layout with store
+   paths.
+
+4. Stores may communicate with other stores, to substitute already built store
+   paths, i.e. a store acts as a binary cache for other stores.
+
+The store API attempts to reuse parts of its API between these three
+consumers by making similarities explicit in the protocol. This leads
+to a protocol that is slightly more complex than a simple "file
+upload/download"-system, but at significantly greater efficiency, both in terms
+of deduplication opportunities as well as granularity.
+
+## The Store model
+
+Contents inside a tvix-store can be grouped into three different message types:
+
+ * Blobs
+ * Directories
+ * PathInfo (see further down)
+
+(check `castore.md` for more detailed field descriptions)
+
+### Blobs
+A blob object contains the literal file contents of regular (or executable)
+files.
+
+### Directory
+A directory object describes the direct children of a directory.
+
+It contains:
+ - name of child (regular or executable) files, and their [blake3][blake3] hash.
+ - name of child symlinks, and their target (as string)
+ - name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG)
+
+### Content-addressed Store Model
+For example, lets consider a directory layout like this, with some
+imaginary hashes of file contents:
+
+```
+.
+├── file-1.txt        hash: 5891b5b522d5df086d0ff0b110fb
+└── nested
+    └── file-2.txt    hash: abc6fd595fc079d3114d4b71a4d8
+```
+
+A hash for the *directory* `nested` can be created by creating the `Directory`
+object:
+
+```json
+{
+  "directories": [],
+  "files": [{
+    "name": "file-2.txt",
+    "digest": "abc6fd595fc079d3114d4b71a4d8",
+    "size": 123,
+  }],
+  "symlink": [],
+}
+```
+
+And then hashing a serialised form of that data structure. We use the blake3
+hash of the canonical protobuf representation. Let's assume the hash was
+`ff0029485729bcde993720749232`.
+
+To create the directory object one layer up, we now refer to our `nested`
+directory object in `directories`, and to `file-1.txt` in `files`:
+
+```json
+{
+  "directories": [{
+    "name": "nested",
+    "digest": "ff0029485729bcde993720749232",
+    "size": 1,
+  }],
+  "files": [{
+    "name": "file-1.txt",
+    "digest": "5891b5b522d5df086d0ff0b110fb",
+    "size": 124,
+  }]
+}
+```
+
+This Merkle DAG of Directory objects, and flat store of blobs can be used to
+describe any file/directory/symlink inside a store path. Due to its content-
+addressed nature, it'll automatically deduplicate (re-)used (sub)directories,
+and allow substitution from any (untrusted) source.
+
+The thing that's now only missing is the metadata to map/"mount" from the
+content-addressed world to a physical path.
+
+### PathInfo
+As most paths in the Nix store currently are input-addressed [^input-addressed],
+and the `tvix-castore` data model is also not intrinsically using NAR hashes,
+we need something mapping from an input-addressed "output path hash" (or a Nix-
+specific content-addressed path) to the contents in the `tvix-castore` world.
+
+That's what `PathInfo` provides. It embeds the root node (Directory, File or
+Symlink) at a given store path.
+
+The root nodes' `name` field is populated with the (base)name inside
+`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`.
+
+The `PathInfo` message also stores references to other store paths, and some
+more NARInfo-specific metadata (signatures, narhash, narsize).
+
+
+## API overview
+
+There's three different services:
+
+### BlobService
+`BlobService` can be used to store and retrieve blobs of data, used to host
+regular file contents.
+
+It is content-addressed, using [blake3][blake3]
+as a hashing function.
+
+As blake3 is a tree hash, there's an opportunity to do
+[verified streaming][bao] of parts of the file,
+which doesn't need to trust any more information than the root hash itself.
+Future extensions of the `BlobService` protocol will enable this.
+
+### DirectoryService
+`DirectoryService` allows lookups (and uploads) of `Directory` messages, and
+whole reference graphs of them.
+
+
+### PathInfoService
+The PathInfo service provides lookups from a store path hash to a `PathInfo`
+message.
+
+## Example flows
+
+Below there are some common use cases of tvix-store, and how the different
+services are used.
+
+###  Upload files and directories
+This is needed for `builtins.path` or `src = ./path` in Nix expressions (A), as
+well as for uploading build artifacts to a store (B).
+
+The path specified needs to be (recursively, BFS-style) traversed.
+ * All file contents need to be hashed with blake3, and submitted to the
+   *BlobService* if not already present.
+   A reference to them needs to be added to the parent Directory object that's
+   constructed.
+ * All symlinks need to be added to the parent directory they reside in.
+ * Whenever a Directory has been fully traversed, it needs to be uploaded to
+   the *DirectoryService* and a reference to it needs to be added to the parent
+   Directory object.
+
+Most of the hashing / directory traversal/uploading can happen in parallel,
+as long as Directory objects only refer to Directory objects and Blobs that
+have already been uploaded.
+
+When reaching the root, a `PathInfo` object needs to be constructed.
+
+ * In the case of content-addressed paths (A), the name of the root node is
+   based on the NAR representation of the contents.
+   It might make sense to be able to offload the NAR calculation to the store,
+   which can cache it.
+ * In the case of build artifacts (B), the output path is input-addressed and
+   known upfront.
+
+Contrary to Nix, this has the advantage of not having to upload a lot of things
+to the store that didn't change.
+
+### Reading files from the store from the evaluator
+This is the case when `nixpkgs` is located in the store, or IFD in general.
+
+The store client asks the `PathInfoService` for the `PathInfo` of the output
+path in the request, and looks at the root node.
+
+If something other than the root of the store path is requested, like for
+example `maintainers/maintainer-list.nix`, the root_node Directory is inspected
+and potentially a chain of `Directory` objects requested from
+*DirectoryService*. [^n+1query].
+
+When the desired file is reached, the *BlobService* can be used to read the
+contents of this file, and return it back to the evaluator.
+
+FUTUREWORK: define how importing from symlinks should/does work.
+
+Contrary to Nix, this has the advantage of not having to copy all of the
+contents of a store path to the evaluating machine, but really only fetching
+the files the evaluator currently cares about.
+
+### Materializing store paths on disk
+This is useful for people running a Tvix-only system, or running builds on a
+"Tvix remote builder" in its own mount namespace.
+
+In a system with Nix installed, we can't simply manually "extract" things to
+`/nix/store`, as Nix assumes to own all writes to this location.
+In these use cases, we're probably better off exposing a tvix-store as a local
+binary cache (that's what `//tvix/nar-bridge-go` does).
+
+Assuming we are in an environment where we control `/nix/store` exclusively, a
+"realize to disk" would either "extract" things from the `tvix-store` to a
+filesystem, or expose a `FUSE`/`virtio-fs` filesystem.
+
+The latter is already implemented, and particularly interesting for (remote)
+build workloads, as build inputs can be realized on-demand, which saves copying
+around a lot of never- accessed files.
+
+In both cases, the API interactions are similar.
+ * The *PathInfoService* is asked for the `PathInfo` of the requested store path.
+ * If everything should be "extracted", the *DirectoryService* is asked for all
+   `Directory` objects in the closure, the file structure is created, all Blobs
+   are downloaded and placed in their corresponding location and all symlinks
+   are created accordingly.
+ * If this is a FUSE filesystem, we can decide to only request a subset,
+   similar to the "Reading files from the store from the evaluator" use case,
+   even though it might make sense to keep all Directory objects around.
+   (See the caveat in "Trust model" though!)
+
+### Stores communicating with other stores
+The gRPC API exposed by the tvix-store allows composing multiple stores, and
+implementing some caching strategies, that store clients don't need to be aware
+of.
+
+ * For example, a caching strategy could have a fast local tvix-store, that's
+   asked first and filled with data from a slower remote tvix-store.
+
+ * Multiple stores could be asked for the same data, and whatever store returns
+   the right data first wins.
+
+
+## Trust model / Distribution
+As already described above, the only non-content-addressed service is the
+`PathInfo` service.
+
+This means, all other messages (such as `Blob` and `Directory` messages) can be
+substituted from many different, untrusted sources/mirrors, which will make
+plugging in additional substitution strategies like IPFS, local network
+neighbors super simple. That's also why it's living in the `tvix-castore` crate.
+
+As for `PathInfo`, we don't specify an additional signature mechanism yet, but
+carry the NAR-based signatures from Nix along.
+
+This means, if we don't trust a remote `PathInfo` object, we currently need to
+"stream" the NAR representation to validate these signatures.
+
+However, the slow part is downloading of NAR files, and considering we have
+more granularity available, we might only need to download some small blobs,
+rather than a whole NAR file.
+
+A future signature mechanism, that is only signing (parts of) the `PathInfo`
+message, which only points to content-addressed data will enable verified
+partial access into a store path, opening up opportunities for lazy filesystem
+access etc.
+
+
+
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[bao]: https://github.com/oconnor663/bao
+[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing
+                    some replacements on refered Input Derivations to calculate
+                    output paths.
+[^n+1query]: This would expose an N+1 query problem. However it's not a problem
+             in practice, as there's usually always a "local" caching store in
+             the loop, and *DirectoryService* supports a recursive lookup for
+             all `Directory` children of a `Directory`