docs(zseri): Add RFC document dbwospof.md r/3496

Title: distributed builds without single points-of-failure Change-Id: I54275ea75d7e29269162f6158743d64d17f79915 Reviewed-on: https://cl.tvl.fyi/c/depot/+/4550 Tested-by: BuildkiteCI Reviewed-by: zseri <zseri.devel@ytrizja.de> Autosubmit: zseri <zseri.devel@ytrizja.de>
author: zseri <zseri.devel@ytrizja.de> 2021-12-24T06·27+0100
committer: clbot <clbot@tvl.fyi> 2021-12-29T01·59+0000
commit: 393fbe81d68e1aa05c5ffdde740210f120a7f626 (patch)
tree: 7926448b325a8dce4bf9519463c2f9bab8f780a2
parent: b763f183f770a628fb9f338b8f52ba8185bccaa7 (diff)
1 files changed, 112 insertions, 0 deletions
diff --git a/users/zseri/dbwospof.md b/users/zseri/dbwospof.md
new file mode 100644
index 000000000000..f1d68cde069c
--- /dev/null
+++ b/users/zseri/dbwospof.md
@@ -0,0 +1,112 @@
+# distributed build without single points of failure
+
+## problem statement
+> If we want to distribute a build across several build nodes, and want to avoid
+> a "single point of failure", what needs to be considered?
+
+## motivation
+
+* distribute the build across several build nodes, because some packages take
+  extremely long to build
+  (e.g. `firefox`, `thunderbird`, `qtwebengine`, `webkitgtk`, ...)
+* avoid a centralised setup like e.g. with Hydra, because we want to keep using
+  an on-demand workflow as usual with Nix
+  (e.g. `nixos-rebuild` on each host when necessary).
+
+## list of abbreviations
+
+<dl>
+  <dt>CA</dt>		<dd>content-addressed</dd>
+  <dt>drv</dt>		<dd>derivation</dd>
+  <dt>FOD</dt>		<dd>fixed-output derivation</dd>
+  <dt>IHA</dt>		<dd>input-hash-addressed</dd>
+  <dt>inhash</dt>	<dd>input hash</dd>
+  <dt>outhash</dt>	<dd>output hash</dd>
+</dl>
+
+## build graph
+
+The build graph can't be easily distributed. It is instead left on the coordinator,
+and the build nodes just get individual build jobs, which just consist of
+derivations (and some information about how to get the inputs from some central
+or distributed store (e.g. Ceph), this may be transmitted "out of band").
+
+## inhash-exclusive
+
+It is necessary that each derivation build is exclusive in the sense that
+the same derivation is never build multiple times simultaneously, because
+this otherwise either wastes compute resources (obviously) and, in the case
+of non-deterministic builds, increases complexity
+(the store needs to decide which result to prefer, and the build nodes with
+"losing" build results need to pull the "winning" build results from the store,
+replacing the local version). Although this might be unnecessary in case
+of IHA drvs, enforcing it always reduces the amount of possible suprising
+results when mixing CA drvs and IHA drvs.
+
+## what can be meaningfully distributed
+
+The following is strongly opinionated, but I consider the following
+(based upon the original build graph implementation from yzix 12.2021):
+* We can't push the entire build graph to each build node, because they would
+  overlap 100%, and thus create extreme contention on the inhash-exclusive lock
+* We could try to split the build graph into multiple parts with independent
+  inputs (partitioning), but this can be really complex, and I'm not sure
+  if it is worth it... This also basically excludes the yzix node types
+  [ `Eval`, `AssertEqual` ] (should be done by the evaluator).
+  Implementing this option however would make an abort of a build graph
+  (the simple variant does not kill running tasks,
+  just stop new tasks from being scheduled) really hard, and complex to get right.
+* It does not make sense to distribute "node tasks" across build nodes which
+  almost exclusively interact with the store, and are not CPU-bound, but I/O bound.
+  This applies to most, if not all, useful FODs. It applies to the yzix node types
+  [ `Dump`, `UnDump`, `Fetch`, `Require` ] (should be performed by evaluator+store).
+* TODO: figure out how to do forced rebuilds (e.g. also prefer a node which is not
+  the build node of the previous realisation of that task)
+
+## coarse per-derivation workflow
+
+```
+    derivation
+    |        |
+    |        |
+   key     build
+    |        |
+    |        |
+    V        V
+  inhash  outhash
+    |     (either CA or IHA)
+     \      /
+      \    /
+       \  /
+    realisation
+```
+
+## build results
+
+Just for completeness, two build results are currently considered:
+
+* success: the build succeeded, and the result is uploaded to the central store
+* failure: the build failed (e.g. build process terminated via error exit code or was killed)
+* another case might be "partial": the build succeeded, but uploading to the
+  central store failed (the result is only available on the build node that built it).
+  This case is interesting, because we don't need to rerun the build, just the upload step
+  needs to be fixed/done semi-manually (e.g. maybe the central store ran out of storage,
+  or the network was unavailable)
+
+## build task queue
+
+It is naïve to think that something like a queue via `rabbitmq` (`AMQP`) or `MQTT`
+suffices, because some requirements are missing:
+
+1. some way to push build results to the clients, and these should be associated
+  to the build inputs (a hacky way might use multiple queues for that, e.g.
+  a `tasks` input queue and a `done` output queue).
+2. some way to lock "inhashes" (see section inhash-exclusive).
+
+The second point is somewhat easy to realise using `etcd`, and using the `watch`
+mechanism it can be used to simulate a queue, and the inhash-addressing of
+queued derivations can be seamlessly integrated.
+
+TODO: maybe we want to adjust the priorities of tasks in the queue, but Nix currently
+doesn't seem to do this, so consider this only when it starts to make sense as a
+performance or lag optimization.
author	zseri <zseri.devel@ytrizja.de>	2021-12-24T06·27+0100
committer	clbot <clbot@tvl.fyi>	2021-12-29T01·59+0000
commit	393fbe81d68e1aa05c5ffdde740210f120a7f626 (patch)
tree	7926448b325a8dce4bf9519463c2f9bab8f780a2
parent	b763f183f770a628fb9f338b8f52ba8185bccaa7 (diff)