about summary refs log tree commit diff
path: root/tvix/docs/src/eval/build-references.md
blob: cfa569c04a2c641c06afcc0ea6d317071a2e0523 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
Build references in derivations
===============================

This document describes how build references are calculated in Tvix. Build
references are used to determine which store paths should be available to a
builder during the execution of a build (i.e. the full build closure of a
derivation).

## String contexts in C++ Nix

In C++ Nix, each string value in the evaluator carries an optional so-called
"string context".

These contexts are themselves a list of strings that take one of the following
formats:

1. `!<output_name>!<drv_path>`

   This format describes a build reference to a specific output of a derivation.

2. `=<drv_path>`

   This format is used for a special case where a derivation attribute directly
   refers to a derivation path (e.g. by accessing `.drvPath` on a derivation).

   ```admonish note
   In C++ Nix this case is quite special and actually requires a store-database
   query during evaluation.
   ```

3. `<path>` - a non-descript store path input, usually a plain source file (e.g.
   from something like `src = ./.` or `src = ./foo.txt`).

   In the case of `unsafeDiscardOutputDependency` this is used to pass a raw
   derivation file, but *not* pull in its outputs.

Lets introduce names for these (in the same order) to make them easier to
reference below:

```rust
enum BuildReference {
    /// !<output_name>!<drv_path>
    SingleOutput(OutputName, DrvPath),

    /// =<drv_path>
    DrvClosure(DrvPath),

    /// <path>
    Path(StorePath),
}
```

String contexts are, broadly speaking, created whenever a string is the result
of a computation (e.g. string interpolation) that used a *computed* path or
derivation in any way.

Note: This explicitly does *not* include simply writing a literal string
containing a store path (whether valid or not). That is only permitted through
the `storePath` builtin.

## Derivation inputs

Based on the data above, the fields `inputDrvs` and `inputSrcs` of derivations
are populated in `builtins.derivationStrict` (the function which
`builtins.derivation`, which isn't actually a builtin, wraps).

`inputDrvs` is represented by a map of derivation paths to the set of their
outputs that were referenced by the context.

TODO: What happens if the set is empty? Somebody claimed this means all outputs.

`inputSrcs` is represented by a set of paths.

These are populated by the above references as follows:

* `SingleOutput` entries are merged into `inputDrvs`
* `Path` entries are inserted into `inputSrcs`
* `DrvClosure` leads to a special store computation (`computeFSClosure`), which
  finds all paths referenced by the derivation and then inserts all of them into
  the fields as above (derivations with _all_ their outputs)

This is then serialised in the derivation and passed down the pipe.

## Builtins interfacing with contexts

C++ Nix has several builtins that interface directly with string contexts:

* `unsafeDiscardStringContext`: throws away a string's string context (if
  present)
* `hasContext`: returns `true`/`false` depending on whether the string has
  context
* `unsafeDiscardOutputDependency`: drops dependencies on the *outputs* of a
  `.drv` in the context, passing only the literal `.drv` itself

  ```admonish note
  This is only used for special test-cases in nixpkgs, and deprecated Nix
  commands like `nix-push`.
  ```
* `getContext`: returns the string context in serialised form as a Nix attribute
  set
* `appendContext`: adds a given string context to the string in the same format
  as returned by `getContext`

Most of the string manipulation operations will propagate the context to the
result based on their parameters' contexts.

## Placeholders

C++ Nix has `builtins.placeholder`, which given the name of an output (e.g.
`out`) creates a hashed string representation of that output name. If that
string is used anywhere in input attributes, the builder will replace it with
the actual name of the corresponding output of the current derivation.

C++ Nix does not use contexts for this, it blindly creates a rewrite map of
these placeholder strings to the names of all outputs, and runs the output
replacement logic on all environment variables it creates, attribute files it
passes etc.

## Tvix & string contexts

In the past, Tvix did not track string contexts in its evaluator at all, see
the historical section for more information about that.

Tvix tracks string contexts in every `NixString` structure via a
`HashSet<BuildReference>` and offers an API to combine the references while
keeping the exact internal structure of that data private.

## Historical attempt: Persistent reference tracking

We were investigating implementing a system which allows us to drop string
contexts in favour of reference scanning derivation attributes.

This means that instead of maintaining and passing around a string context data
structure in eval, we maintain a data structure of *known paths* from the same
evaluation elsewhere in Tvix, and scan each derivation attribute against this
set of known paths when instantiating derivations.

We believed we could take the stance that the system of string contexts as
implemented in C++ Nix is likely an implementation detail that should not be
leaking to the language surface as it does now.

### Tracking "known paths"

Every time a Tvix evaluation does something that causes a store interaction, a
"known path" is created. On the language surface, this is the result of one of:

1. Path literals (e.g. `src = ./.`).
2. Calls to `builtins.derivationStrict` yielding a derivation and its output
   paths.
3. Calls to `builtins.path`.

Whenever one of these occurs, some metadata that persists for the duration of
one evaluation should be created in Nix. This metadata needs to be available in
`builtins.derivationStrict`, and should be able to respond to these queries:

1. What is the set of all known paths? (used for e.g. instantiating an
   Aho-Corasick type string searcher)
2. What is the _type_ of a path? (derivation path, derivation output, source
   file)
3. What are the outputs of a derivation?
4. What is the derivation of an output?

These queries will need to be asked of the metadata when populating the
derivation fields.

```admonish note
Depending on how we implement `builtins.placeholder`, it might be useful
to track created placeholders in this metadata, too.
```

### Context builtins

Context-reading builtins can be implemented in Tvix by adding `hasContext` and
`getContext` with the appropriate reference-scanning logic. However, we should
evaluate how these are used in nixpkgs and whether their uses can be removed.

Context-mutating builtins can be implemented by tracking their effects in the
value representation of Tvix, however we should consider not doing this at all.

`unsafeDiscardOutputDependency` should probably never be used and we should warn
or error on it.

`unsafeDiscardStringContext` is often used as a workaround for avoiding IFD in
inconvenient places (e.g. in the TVL depot pipeline generation). This is
unnecessary in Tvix. We should evaluate which other uses exist, and act on them
appropriately.

The initial danger with diverging here is that we might cause derivation hash
discrepancies between Tvix and C++ Nix, which can make initial comparisons of
derivations generated by the two systems difficult. If this occurs we need to
discuss how to approach it, but initially we will implement the mutating
builtins as no-ops.

### Why this did not work for us?

Nix has a feature to perform environmental checks of your derivation, e.g.
"these derivation outputs should not be referenced in this derivation", this was
introduced in Nix 2.2 by
https://github.com/NixOS/nix/commit/3cd15c5b1f5a8e6de87d5b7e8cc2f1326b420c88.

Unfortunately, this feature introduced a very unfortunate and critical bug: all
usage of this feature with contextful strings will actually force the
derivation to depend at least at build time on those specific paths, see
https://github.com/NixOS/nix/issues/4629.

For example, if you wanted to `disallowedReferences` to a package and you used a
derivation as a path, you would actually register that derivation as a input
derivation of that derivation.

This bug is still unfixed in Nix and it seems that fixing it would require
introducing different ways to evaluate Nix derivations to preserve the
output path calculation for Nix expressions so far.

All of this would be fine if the bug behavior was uniform in the sense that no
one tried to force-workaround it. Since Nixpkgs 23.05, due to
https://github.com/NixOS/nixpkgs/pull/211783 this is not true anymore.

If you let nixpkgs be the disjoint union of bootstrapping derivations $A$ and
`stdenv.mkDerivation`-built derivations $B$.

$A$ suffers from the bug and $B$ doesn't by the forced usage of
`unsafeDiscardStringContext` on those special checking fields.

This means that to build hash-compatible $A$ **and** $B$, we need to
distinguish $A$ and $B$. A lot of hacks could be imagined to support this
problem.

Let's assume we have a solution to that problem, it means that we are able to
detect implicitly when a set of specific fields are
`unsafeDiscardStringContext`-ed.

Thus, we could use that same trick to implement `unsafeDiscardStringContext`
entirely for all fields actually.

Now, to implement `unsafeDiscardStringContext` in the persistent reference
tracking model, you will need to store a disallowed list of strings that should
not trigger a reference when we are scanning a derivation parameters.

But assume you have something like:

```nix
derivation {
   buildInputs = [
     stdenv.cc
   ];

   disallowedReferences = [ stdenv.cc ];
}
```

If you unregister naively the `stdenv.cc` reference, it will silence the fact
that it is part of the `buildInputs`, so you will observe that Nix will fail
the derivation during environmental check, but Tvix would silently force remove
that reference.

Until proven otherwise, it seems highly difficult to have the fine-grained
information to prevent reference tracking of those specific fields. It is not a
failure of the persistent reference tracking, it is an unresolved critical bug
of Nix that only nixpkgs really workarounded for `stdenv.mkDerivation`-based
derivations.