about summary refs log tree commit diff
path: root/tvix/castore
diff options
context:
space:
mode:
authorFlorian Klink <flokli@flokli.de>2024-06-13T19·04+0300
committerclbot <clbot@tvl.fyi>2024-06-14T08·00+0000
commit6947dc4349fa85cb702f46acfe3255c907096b12 (patch)
tree50d50d214a20901b29944e20b722489e0e1c3253 /tvix/castore
parentadc7353bd16d07e13efb7f6a84b9f93601a07705 (diff)
chore(tvix/docs): move [ca]store docs to tvix/docs r/8269
Change-Id: Idd78ffae34b6ea7b93d13de73b98c61a348869fb
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11808
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: flokli <flokli@flokli.de>
Diffstat (limited to 'tvix/castore')
-rw-r--r--tvix/castore/docs/blobstore-chunking.md147
-rw-r--r--tvix/castore/docs/blobstore-protocol.md104
-rw-r--r--tvix/castore/docs/data-model.md50
-rw-r--r--tvix/castore/docs/why-not-git-trees.md57
4 files changed, 0 insertions, 358 deletions
diff --git a/tvix/castore/docs/blobstore-chunking.md b/tvix/castore/docs/blobstore-chunking.md
deleted file mode 100644
index df3c29680257..000000000000
--- a/tvix/castore/docs/blobstore-chunking.md
+++ /dev/null
@@ -1,147 +0,0 @@
-# BlobStore: Chunking & Verified Streaming
-
-`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
-as hash function.
-
-Returned data is fetched by using the digest as lookup key, and can be verified
-to be correct by feeding the received data through the hash function and
-ensuring it matches the digest initially used for the lookup.
-
-This means, data can be downloaded by any untrusted third-party as well, as the
-received data is validated to match the digest it was originally requested with.
-
-However, for larger blobs of data, having to download the entire blob at once is
-wasteful, if we only care about a part of the blob. Think about mounting a
-seekable data structure, like loop-mounting an .iso file, or doing partial reads
-in a large Parquet file, a column-oriented data format.
-
-> We want to have the possibility to *seek* into a larger file.
-
-This however shouldn't compromise on data integrity properties - we should not
-need to trust a peer we're downloading from to be "honest" about the partial
-data we're reading. We should be able to verify smaller reads.
-
-Especially when substituting from an untrusted third-party, we want to be able
-to detect quickly if that third-party is sending us wrong data, and terminate
-the connection early.
-
-## Chunking
-In content-addressed systems, this problem has historically been solved by
-breaking larger blobs into smaller chunks, which can be fetched individually,
-and making a hash of *this listing* the blob digest/identifier.
-
- - BitTorrent for example breaks files up into smaller chunks, and maintains
-   a list of sha1 digests for each of these chunks. Magnet links contain a
-   digest over this listing as an identifier. (See [bittorrent-v2][here for
-   more details]).
-   With the identifier, a client can fetch the entire list, and then recursively
-   "unpack the graph" of nodes, until it ends up with a list of individual small
-   chunks, which can be fetched individually.
- - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
-   digest of the root node as an identitier.
-
-These approaches solve the problem of being able to fetch smaller chunks in a
-trusted fashion. They can also do some deduplication, in case there's the same
-leaf nodes same leaf nodes in multiple places.
-
-However, they also have a big disadvantage. The chunking parameters, and the
-"topology" of the graph structure itself "bleed" into the root hash of the
-entire data structure itself.
-
-Depending on the chunking parameters used, there's different representations for
-the same data, causing less data sharing/reuse in the overall system, in terms of how
-many chunks need to be downloaded vs. are already available locally, as well as
-how compact data is stored on-disk.
-
-This can be workarounded by agreeing on only a single way of chunking, but it's
-not pretty and misses a lot of deduplication potential.
-
-### Chunking in Tvix' Blobstore
-tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
-disadvantages, while still being content-addressed internally, with the
-highlighted benefits.
-
-It uses [blake3] as hash function, and the blake3 digest of **the raw data
-itself** as an identifier (rather than some application-specific Merkle DAG that
-also embeds some chunking information).
-
-BLAKE3 is a tree hash where all left nodes fully populated, contrary to
-conventional serial hash functions. To be able to validate the hash of a node,
-one only needs the hash of the (2) children [^1], if any.
-
-This means one only needs to the root digest to validate a constructions, and these
-constructions can be sent [separately][bao-spec].
-
-This relieves us from the need of having to encode more granular chunking into
-our data model / identifier upfront, but can make this mostly a transport/
-storage concern.
-
-For some more description on the (remote) protocol, check
-`./blobstore-protocol.md`.
-
-#### Logical vs. physical chunking
-
-Due to the properties of the BLAKE3 hash function, we have logical blocks of
-1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
-chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
-or are stored on-disk.
-
-The only thing we need to be able to read and verify an arbitrary byte range is
-having the covering range of aligned 1K blocks, and a construction from the root
-digest to the 1K block.
-
-Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
-lower parts of the tree while still providing verified streaming - at the cost
-of having to fetch larger covering ranges of aligned blocks.
-
-Let's pick an example. We identify each KiB by a number here for illustrational
-purposes.
-
-Assuming we omit the last two layers of the hash tree, we end up with logical
-4KiB leaf chunks (`bao_shift` of `2`).
-
-For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
-`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
-
-`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
-
-Assuming the server now informs us about the following physical chunking:
-
-```
-[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
-```
-
-If our application now wants to arbitrarily read from 0 until 4 (inclusive):
-
-```
-[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
- |-------------|
-
-```
-
-…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
-
-
-`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
-interested in.
-
-We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
-can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
-
-```
-[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
-^       ^           ^     ^
-|----4KiB----|------4KiB-----|
-```
-
-Each physical chunk fetched can be validated to have the blake3 digest that was
-communicated upfront, and can be stored in a client-side cache/storage, so
-subsequent / other requests for the same data will be fast(er).
-
----
-
-[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
-[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
-[blake3]: https://github.com/BLAKE3-team/BLAKE3
-[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
-[bao-tree]: https://github.com/n0-computer/bao-tree
diff --git a/tvix/castore/docs/blobstore-protocol.md b/tvix/castore/docs/blobstore-protocol.md
deleted file mode 100644
index 048cafc3d877..000000000000
--- a/tvix/castore/docs/blobstore-protocol.md
+++ /dev/null
@@ -1,104 +0,0 @@
-# BlobStore: Protocol / Composition
-
-This documents describes the protocol that BlobStore uses to substitute blobs
-other ("remote") BlobStores.
-
-How to come up with the blake3 digest of the blob to fetch is left to another
-layer in the stack.
-
-To put this into the context of Tvix as a Nix alternative, a blob represents an
-individual file inside a StorePath.
-In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
-`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
-message) encode a BLAKE3 digest.
-
-However, the whole infrastructure can be applied for other usecases requiring
-exchange/storage or access into data of which the blake3 digest is known.
-
-## Protocol and Interfaces
-As an RPC protocol, BlobStore currently uses gRPC.
-
-On the Rust side of things, every blob service implements the
-[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
-gRPC-specific.
-
-This `BlobService` trait provides functionality to check for existence of Blobs,
-read from blobs, and write new blobs.
-It also provides a method to ask for more granular chunks if they are available.
-
-In addition to some in-memory, on-disk and (soon) object-storage-based
-implementations, we also have a `BlobService` implementation that talks to a
-gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
-service for anything implementing the `BlobService` trait.
-
-This makes it very easy to talk to a remote `BlobService`, which does not even
-need to be written in the same language, as long it speaks the same gRPC
-protocol.
-
-It also puts very little requirements on someone implementing a new
-`BlobService`, and how its internal storage or chunking algorithm looks like.
-
-The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
-Contrary to the `BlobService` trait, it does not have any options for seeking/
-ranging, as it's more desirable to provide this through chunking (see also
-`./blobstore-chunking.md`).
-
-## Composition
-Different `BlobStore` are supposed to be "composed"/"layered" to express
-caching, multiple local and remote sources.
-
-The fronting interface can be the same, it'd just be multiple "tiers" that can
-respond to requests, depending on where the data resides. [^1]
-
-This makes it very simple for consumers, as they don't need to be aware of the
-entire substitutor config.
-
-The flexibility of this doesn't need to be exposed to the user in the default
-case; in most cases we should be fine with some form of on-disk storage and a
-bunch of substituters with different priorities.
-
-### gRPC Clients
-Clients are encouraged to always read blobs in a chunked fashion (asking for a
-list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
-`BlobService.Read()` as needed), instead of directly reading the entire blob via
-`BlobService.Read()`.
-
-In a composition setting, this provides opportunity for caching, and avoids
-downloading some chunks if they're already present locally (for example, because
-they were already downloaded by reading from a similar blob earlier).
-
-It also removes the need for seeking to be a part of the gRPC protocol
-alltogether, as chunks are supposed to be "reasonably small" [^2].
-
-There's some further optimization potential, a `BlobService.Stat()` request
-could tell the server it's happy with very small blobs just being inlined in
-an additional additional field in the response, which would allow clients to
-populate their local chunk store in a single roundtrip.
-
-## Verified Streaming
-As already described in `./docs/blobstore-chunking.md`, the physical chunk
-information sent in a `BlobService.Stat()` response is still sufficient to fetch
-in an authenticated fashion.
-
-The exact protocol and formats are still a bit in flux, but here's some notes:
-
- - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
-   [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
-   how detailed (down to the leaf chunks) it should go.
-   The exact format (and request fields) still need to be defined, edef has some
-   ideas around omitting some intermediate hash nodes over the wire and
-   recomputing them, reducing size by another ~50% over [bao-tree].
- - `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
-   field, signalling the actual format/shift level the server replies with, the
-   actual bao, and maybe some format specifier).
-   It would be nice to also be compatible with the baos used by [iroh], so we
-   can provide an implementation using it too.
-
----
-
-[^1]: We might want to have some backchannel, so it becomes possible to provide
-      feedback to the user that something is downloaded.
-[^2]: Something between 512K-4M, TBD.
-[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
-[bao-tree]: https://github.com/n0-computer/bao-tree
-[iroh]: https://github.com/n0-computer/iroh
diff --git a/tvix/castore/docs/data-model.md b/tvix/castore/docs/data-model.md
deleted file mode 100644
index 5e6220cc23fa..000000000000
--- a/tvix/castore/docs/data-model.md
+++ /dev/null
@@ -1,50 +0,0 @@
-# Data model
-
-This provides some more notes on the fields used in castore.proto.
-
-See `//tvix/store/docs/api.md` for the full context.
-
-## Directory message
-`Directory` messages use the blake3 hash of their canonical protobuf
-serialization as its identifier.
-
-A `Directory` message contains three lists, `directories`, `files` and
-`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages
-respectively. They describe all the direct child elements that are contained in
-a directory.
-
-All three message types have a `name` field, specifying the (base)name of the
-element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..').
-For reproducibility reasons, the lists MUST be sorted by that name and the
-name MUST be unique across all three lists.
-
-In addition to the `name` field, the various *Node messages have the following
-fields:
-
-## DirectoryNode
-A `DirectoryNode` message represents a child directory.
-
-It has a `digest` field, which points to the identifier of another `Directory`
-message, making a `Directory` a merkle tree (or strictly speaking, a graph, as
-two elements pointing to a child directory with the same contents would point
-to the same `Directory` message).
-
-There's also a `size` field, containing the (total) number of all child
-elements in the referenced `Directory`, which helps for inode calculation.
-
-## FileNode
-A `FileNode` message represents a child (regular) file.
-
-Its `digest` field contains the blake3 hash of the file contents. It can be
-looked up in the `BlobService`.
-
-The `size` field contains the size of the blob the `digest` field refers to.
-
-The `executable` field specifies whether the file should be marked as
-executable or not.
-
-## SymlinkNode
-A `SymlinkNode` message represents a child symlink.
-
-In addition to the `name` field, the only additional field is the `target`,
-which is a string containing the target of the symlink.
diff --git a/tvix/castore/docs/why-not-git-trees.md b/tvix/castore/docs/why-not-git-trees.md
deleted file mode 100644
index 4a12b4ef5554..000000000000
--- a/tvix/castore/docs/why-not-git-trees.md
+++ /dev/null
@@ -1,57 +0,0 @@
-## Why not git tree objects?
-
-We've been experimenting with (some variations of) the git tree and object
-format, and ultimately decided against using it as an internal format, and
-instead adapted the one documented in the other documents here.
-
-While the tvix-store API protocol shares some similarities with the format used
-in git for trees and objects, the git one has shown some significant
-disadvantages:
-
-### The binary encoding itself
-
-#### trees
-The git tree object format is a very binary, error-prone and
-"made-to-be-read-and-written-from-C" format.
-
-Tree objects are a combination of null-terminated strings, and fields of known
-length. References to other tree objects use the literal sha1 hash of another
-tree object in this encoding.
-Extensions of the format/changes are very hard to do right, because parsers are
-not aware they might be parsing something different.
-
-The tvix-store protocol uses a canonical protobuf serialization, and uses
-the [blake3][blake3] hash of that serialization to point to other `Directory`
-messages.
-It's both compact and with a wide range of libraries for encoders and decoders
-in many programming languages.
-The choice of protobuf makes it easy to add new fields, and make old clients
-aware of some unknown fields being detected [^adding-fields].
-
-#### blob
-On disk, git blob objects start with a "blob" prefix, then the size of the
-payload, and then the data itself. The hash of a blob is the literal sha1sum
-over all of this - which makes it something very git specific to request for.
-
-tvix-store simply uses the [blake3][blake3] hash of the literal contents
-when referring to a file/blob, which makes it very easy to ask other data
-sources for the same data, as no git-specific payload is included in the hash.
-This also plays very well together with things like [iroh][iroh-discussion],
-which plans to provide a way to substitute (large)blobs by their blake3 hash
-over the IPFS network.
-
-In addition to that, [blake3][blake3] makes it possible to do
-[verified streaming][bao], as already described in other parts of the
-documentation.
-
-The git tree object format uses sha1 both for references to other trees and
-hashes of blobs, which isn't really a hash function to fundamentally base
-everything on in 2023.
-The [migration to sha256][git-sha256] also has been dead for some years now,
-and it's unclear what a "blake3" version of this would even look like.
-
-[bao]: https://github.com/oconnor663/bao
-[blake3]: https://github.com/BLAKE3-team/BLAKE3
-[git-sha256]: https://git-scm.com/docs/hash-function-transition/
-[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
-[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.
\ No newline at end of file