TIP: This blog post was originally published as a design document for
[Nixery][] and is not written in the same style
as other blog posts.
Thanks to my colleagues at Google and various people from the Nix community for
reviewing this.
------
# Nixery: Improved Layering
**Authors**: tazjin@
**Reviewers**: so...@, en...@, pe...@
**Status**: Implemented
**Last Updated**: 2019-08-10
## Introduction
This document describes a design for an improved image layering method for use
in Nixery. The algorithm [currently used][grhmc] is designed for a slightly
different use-case and we can improve upon it by making use of more of the
available data.
## Background / Motivation
Nixery is a service that uses the [Nix package manager][nix] to build container
images (for runtimes such as Docker), that are served on-demand via the
container [registry protocols][]. A demo instance is available at
[nixery.dev][].
In practice this means users can simply issue a command such as `docker pull
nixery.dev/shell/git` and receive an image that was built ad-hoc containing a
shell environment and git.
One of the major advantages of building container images via Nix (as described
for `buildLayeredImage` in [this blog post][grhmc]) is that the
content-addressable nature of container image layers can be used to provide more
efficient caching characteristics (caching based on layer content) than what is
common with Dockerfiles and other image creation methods (caching based on layer
creation method).
However, this is constrained by the maximum number of layers supported in an
image (125). A naive approach such as putting each included package (any
library, binary, etc.) in its own layer quickly runs into this limitation due to
the large number of dependencies more complex systems tend to have. In addition,
users wanting to extend images created by Nixery (e.g. via `FROM nixery.dev/…`)
share this layer maximum with the created image - limiting extensibility if all
layers are used up by Nixery.
In theory the layering strategy of `buildLayeredImage` should already provide
good caching characteristics, but in practice we are seeing many images with
significantly more packages than the number of layers configured, leading to
more frequent cache-misses than desired.
The current implementation of `buildLayeredImage` inspects a graph of image
dependencies and determines the total number of references (direct & indirect)
to any node in the graph. It then sorts all dependencies by this popularity
metric and puts the first `n - 2` (for `n` being the maximum number of layers)
packages in their own layers, all remaining packages in one layer and the image
configuration in the final layer.
## Design / Proposal
## (Close-to) ideal layer-layout using more data
We start out by considering what a close to ideal layout of layers would look
like for a simple use-case.
![Ideal layout](/static/img/nixery/ideal_layout.webp)
In this example, counting the total number of references to each node in the
graph yields the following result:
| pkg | refs |
|-------|------|
| E | 3 |
| D | 2 |
| F | 2 |
| A,B,C | 1 |
Assuming we are constrained to 4 layers, the current algorithm would yield these layers:
```
L1: E
L2: D
L3: F
L4: A, B, C
```
The initial proposal for this design is that additional data should be
considered in addition to the total number of references, in particular a
distinction should be made between direct and indirect references. Packages that
are only referenced indirectly should be merged with their parents.
This yields the following table:
| pkg | direct | indirect |
|-------|--------|----------|
| E | 3 | 3 |
| D | 2 | 2 |
| F | *1* | 2 |
| A,B,C | 1 | 1 |
Despite having two indirect references, F is in fact only being referred to
once. Assuming that we have no other data available outside of this graph, we
have no reason to assume that F has any popularity outside of the scope of D.
This might yield the following layers:
```
L1: E
L2: D, F
L3: A
L4: B, C
```
D and F were grouped, while the top-level references (i.e. the packages
explicitly requested by the user) were split up.
An assumption is introduced here to justify this split: The top-level packages
is what the user is modifying directly, and those groupings are likely
unpredictable. Thus it is opportune to not group top-level packages in the same
layer.
This raises a new question: Can we make better decisions about where to split
the top-level?
## (Even closer to) ideal layering using (even) more data
So far when deciding layer layouts, only information immediately available in
the build graph of the image has been considered. We do however have much more
information available, as we have both the entire nixpkgs-tree and potentially
other information (such as download statistics).
We can calculate the total number of references to any derivation in nixpkgs and
use that to rank the popularity of each package. Packages within some percentile
can then be singled out as good candidates for a separate layer.
When faced with a splitting decision such as in the last section, this data can
aid the decision. Assume for example that package B in the above is actually
`openssl`, which is a very popular package. Taking this into account would
instead yield the following layers:
```
L1: E,
L2: D, F
L3: B,
L4: A, C
```
## Layer budgets and download size considerations
As described in the introduction, there is a finite amount of layers available
for each image (the “layer budget”). When calculating the layer distribution, we
might end up with the “ideal” list of layers that we would like to create. Using
our previous example:
```
L1: E,
L2: D, F
L3: A
L4: B
L5: C
```
If we only have a layer budget of 4 available, something needs to be merged into
the same layer. To make a decision here we could consider only the package
popularity, but there is in fact another piece of information that has not come
up yet: The actual size of the package.
Presumably a user would not mind downloading a library that is a few kilobytes
in size repeatedly, but they would if it was a 200 megabyte binary instead.
Conversely if a large binary was successfully cached, but an extremely popular
small library is not, the total download size might also grow to irritating
levels.
To avoid this we can calculate a merge rating:
merge_rating(pkg) = popularity_percentile(pkg) × size(pkg.subtree)
Packages with a low merge rating would be merged together before packages with
higher merge ratings.
## Implementation
There are two primary components of the implementation:
1. The layering component which, given an image specification, decides the image
layers.
2. The popularity component which, given the entire nixpkgs-tree, calculates the
popularity of packages.
## Layering component
It turns out that graph theory’s concept of [dominator trees][] maps reasonably
well onto the proposed idea of separating direct and indirect dependencies. This
becomes visible when creating the dominator tree of a simple example:
![Example without extra edges](/static/img/nixery/example_plain.webp)
Before calculating the dominator tree, we inspect each node and insert extra
edges from the root for packages that match a certain popularity or size
threshold. In this example, G is popular and an extra edge is inserted:
![Example with extra edges](/static/img/nixery/example_extra.webp)
Calculating the dominator tree of this graph now yields our ideal layer
distribution:
![Dominator tree of example](/static/img/nixery/dominator.webp)
The nodes immediately dominated by the root node can now be “harvested” as image
layers, and merging can be performed as described above until the result fits
into the layer budget.
To implement this, the layering component uses the [gonum/graph][] library which
supports calculating dominator trees. The program is fed with Nix’s
`exportReferencesGraph` (which contains the runtime dependency graph and runtime
closure size) as well as the popularity data and layer budget. It returns a list
of layers, each specifying the paths it should contain.
Nix invokes this program and uses the output to create a derivation for each
layer, which is then built and returned to Nixery as usual.
TIP: This is implemented in [`layers.go`][layers.go] in Nixery. The file starts
with an explanatory comment that talks through the process in detail.
## Popularity component
The primary issue in calculating the popularity of each package in the tree is
that we are interested in the runtime dependencies of a derivation, not its
build dependencies.
To access information about the runtime dependency, the derivation actually
needs to be built by Nix - it can not be inferred because Nix does not know
which store paths will still be referenced by the build output.
However for packages that are cached in the NixOS cache, we can simply inspect
the `narinfo`-files and use those to determine popularity.
Not every package in nixpkgs is cached, but we can expect all *popular* packages
to be cached. Relying on the cache should therefore be reasonable and avoids us
having to rebuild/download all packages.
The implementation will read the `narinfo` for each store path in the cache at a
given commit and create a JSON-file containing the total reference count per
package.
For the public Nixery instance, these popularity files will be distributed via a
GCS bucket.
TIP: This is implemented in [popcount][] in Nixery.
--------
Hopefully this detailed design review was useful to you. You can also watch [my
NixCon talk][talk] about Nixery for a review of some of this, and some demos.
[Nixery]: https://cs.tvl.fyi/depot/-/tree/tools/nixery
[grhmc]: https://grahamc.com/blog/nix-and-layered-docker-images
[Nix]: https://nixos.org/nix
[registry protocols]: https://github.com/opencontainers/distribution-spec/blob/master/spec.md
[nixery.dev]: https://nixery.dev
[dominator trees]: https://en.wikipedia.org/wiki/Dominator_(graph_theory)
[gonum/graph]: https://godoc.org/gonum.org/v1/gonum/graph
[layers.go]: https://cs.tvl.fyi/depot/-/blob/tools/nixery/layers/layers.go
[popcount]: https://cs.tvl.fyi/depot/-/tree/tools/nixery/popcount
[talk]: https://www.youtube.com/watch?v=pOI9H4oeXqA