From 466e6dc2651c3c8536e8570d74bee401c842adb1 Mon Sep 17 00:00:00 2001
From: Vincent Ambo <mail@tazj.in>
Date: Tue, 7 Mar 2023 19:54:57 +0300
Subject: docs(tvix/eval): document inner workings of the new VM loop

Change-Id: I8224bf039f739c401900b5a2ddc839810c87cf6e
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8226
Tested-by: BuildkiteCI
Reviewed-by: Adam Joseph <adam@westernsemico.com>
---
 tvix/eval/docs/vm-loop.md | 329 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 284 insertions(+), 45 deletions(-)

(limited to 'tvix/eval')

diff --git a/tvix/eval/docs/vm-loop.md b/tvix/eval/docs/vm-loop.md
index 6bf16fc533..bbe02a0866 100644
--- a/tvix/eval/docs/vm-loop.md
+++ b/tvix/eval/docs/vm-loop.md
@@ -1,68 +1,295 @@
 tvix-eval VM loop
 =================
 
-Date: 2023-02-14
-Author: tazjin
+This document describes the new tvix-eval VM execution loop implemented in the
+chain focusing around cl/8104.
 
 ## Background
 
-The VM loop implemented in `src/vm.rs` currently has a couple of functions:
+The VM loop implemented in Tvix prior to cl/8104 had several functions:
 
-1. Advance the instruction pointer and execute instructions, in a loop
-   (unsurprisingly).
+1. Advancing the instruction pointer for a chunk of Tvix bytecode and
+   executing instructions in a loop until a result was yielded.
 
-2. Tracking Nix call frames as functions/thunks are entered/exited.
+2. Tracking Nix call frames as functions/thunks were entered/exited.
 
-3. Catch trampoline requests returned from instruction executions, and
-   resuming of executions afterwards.
+3. Catching trampoline requests returned from instructions to force suspended
+   thunks without increasing stack size *where possible*.
 
-4. Invoking the inner trampoline loop, handling actions and
-   continuations from the trampoline.
+4. Handling trampolines through an inner trampoline loop, switching between a
+   code execution mode and execution of subsequent trampolines.
 
-The current implementation of the trampoline logic was added on to the
-existing VM, which recursed for thunk forcing. As a result, it is
-currently a little difficult to understand what exactly is going on in
-the VM loop and how the trampoline logic works.
+This implementation of the trampoline logic was added on to the existing VM,
+which previously always recursed for thunk forcing. There are some cases (for
+example values that need to be forced *inside* of the execution of a builtin)
+where trampolines could not previously be used, and the VM recursed anyways.
 
-This has also led to several bugs, for example: b/251, b/246, b/245,
-and b/238.
+As a result of this trampoline logic being added "on top" of the existing VM
+loop the code became quite difficult to understand. This led to several bugs,
+for example: b/251, b/246, b/245, and b/238.
 
-These bugs are very tricky to deal with, as we have to try and make
-the VM do things that are somewhat difficult to fit into the current
-model. We could of course keep extending the trampoline logic to
-accomodate all sorts of concepts (such as finalisers), but that seems
-difficult.
+These bugs were tricky to deal with, as we had to try and make the VM do
+things that are somewhat difficult to fit into its model. We could of course
+keep extending the trampoline logic to accommodate all sorts of concepts (such
+as finalisers), but that seems like it does not solve the root problem.
 
-There are also additional problems, such as forcing inside of builtin
-implementations, which leads to a situation where we would like to
-suspend the builtin and return control flow to the VM until the value
-is forced.
+## New VM loop
 
-## Proposal
+In cl/8104, a unified new solution is implemented with which the VM is capable
+of evaluating everything without increasing the call stack size.
 
-We rewrite parts of the VM loop to elevate trampolining and
-potentially other modes of execution to a top-level concept of the VM.
+This is done by introducing a new frame stack in the VM, on which execution
+frames are enqueued that are either:
 
-We achieve this by replacing the current concept of call frames with a
-"VM context" (naming tbd), which can represent multiple different
-states of the VM:
+1. A bytecode frame, consisting of Tvix bytecode that evaluates compiled Nix
+   code.
+2. A generator frame, consisting of some VM logic implemented in pure Rust
+   code that can be *suspended* when it hits a point where the VM would
+   previously need to recurse.
 
-1. Tvix code execution, equivalent to what is currently a call frame,
-   executing bytecode until the instruction pointer reaches the end of
-   a chunk, then returning one level up.
+We do this by making use of the `async` *keyword* in Rust, but notably
+*without* introducing asynchronous I/O or concurrency in tvix-eval (the
+complexity of which is currently undesirable for us).
 
-2. Trampolining the forcing of a thunk, equivalent to the current
-   trampolining logic.
+Specifically, when writing a Rust function that uses the `async` keyword, such
+as:
 
-3. Waiting for the result of a trampoline, to ensure that in case of
-   nested thunks all representations are correctly transformed.
+```rust
+async fn some_builtin(input: Value) -> Result<Value, ErrorKind> {
+  let mut out = NixList::new();
 
-4. Trampolining the execution of a builtin. This is not in scope for
-   the initial implementation, but it should be conceptually possible.
+  for element in input.to_list()? {
+    let result = do_something_that_requires_the_vm(element).await;
+    out.push(result);
+  }
 
-It is *not* in scope for this proposal to enable parallel suspension
-of thunks. This is a separate topic which is discussed in [waiting for
-the store][wfs].
+  Ok(out)
+}
+```
+
+The compiler actually generates a state-machine under-the-hood which allows
+the execution of that function to be *suspended* whenever it hits an `await`.
+
+We use the [`genawaiter`][] crate that gives us a data structure and simple
+interface for getting instances of these state machines that can be stored in
+a struct (in our case, a *generator frame*).
+
+The execution of the VM then becomes the execution of an *outer loop*, which
+is responsible for selecting the next generator frame to execute, and two
+*inner loops*, which drive the execution of a bytecode frame or generator
+frame forward until it either yields a value or asks to be suspended in favour
+of another frame.
+
+All "communication" between frames happens solely through values left on the
+stack: Whenever a frame of either type runs to completion, it is expected to
+leave a *single* value on the stack. It follows that the whole VM, upon
+completion of the last (or initial, depending on your perspective) frame
+yields its result as the return value.
+
+The core of the VM restructuring is cl/8104, unfortunately one of the largest
+single commit changes we've had to make yet, as it touches pretty much all
+areas of tvix-eval. The introduction of the generators and the
+message/response system we built to request something from the VM, suspend a
+generator, and wait for the return is in cl/8148.
+
+The next sections describe in detail how the three different loops work.
+
+### Outer VM loop
+
+The outer VM loop is responsible for selecting the next frame to run, and
+dispatching it correctly to inner loops, as well as determining when to shut
+down the VM and return the final result.
+
+```
+                          ╭──────────────────╮
+                 ╭────────┤ match frame kind ├──────╮
+                 │        ╰──────────────────╯      │
+                 │                                  │
+    ┏━━━━━━━━━━━━┷━━━━━┓                ╭───────────┴───────────╮
+───►┃ frame_stack.pop()┃                ▼                       ▼
+    ┗━━━━━━━━━━━━━━━━━━┛       ┏━━━━━━━━━━━━━━━━┓      ┏━━━━━━━━━━━━━━━━━┓
+                 ▲             ┃ bytecode frame ┃      ┃ generator frame ┃
+                 │             ┗━━━━━━━━┯━━━━━━━┛      ┗━━━━━━━━┯━━━━━━━━┛
+                 │[yes, cont.]          │                       │
+                 │                      ▼                       ▼
+    ┏━━━━━━━━┓   │             ╔════════════════╗      ╔═════════════════╗
+◄───┨ return ┃   │             ║ inner bytecode ║      ║ inner generator ║
+    ┗━━━━━━━━┛   │             ║      loop      ║      ║      loop       ║
+        ▲        │             ╚════════╤═══════╝      ╚════════╤════════╝
+        │   ╭────┴─────╮                │                       │
+        │   │ has next │                ╰───────────┬───────────╯
+   [no] ╰───┤  frame?  │                            │
+            ╰────┬─────╯                            ▼
+                 │                         ┏━━━━━━━━━━━━━━━━━┓
+                 │                         ┃ frame completed ┃
+                 ╰─────────────────────────┨  or suspended   ┃
+                                           ┗━━━━━━━━━━━━━━━━━┛
+```
+
+Initially, the VM always pops a frame from the frame stack and then inspects
+the type of frame it found. As a consequence the next frame to execute is
+always the frame at the top of the stack, and setting up a VM initially for
+code execution is done by leaving a bytecode frame with the code to execute on
+the stack and passing control to the outer loop.
+
+Control is dispatched to either of the inner loops (depending on the type of
+frame) and the cycle continues once they return.
+
+When an inner loop returns, it has either finished its execution (and left its
+result value on the *value stack*), or its frame has requested to be
+suspended.
+
+Frames request suspension by re-enqueueing *themselves* through VM helper
+methods, and then leaving the frame they want to run *on top* of themselves in
+the frame stack before yielding control back to the outer loop.
+
+The inner control loops inform the outer loops about whether the frame has
+been *completed* or *suspended* by returning a boolean.
+
+### Inner bytecode loop
+
+The inner bytecode loop drives the execution of some Tvix bytecode by
+continously looking at the next instruction to execute, and dispatching to the
+instruction handler.
+
+```
+   ┏━━━━━━━━━━━━━┓
+◄──┨ return true ┃
+   ┗━━━━━━━━━━━━━┛
+          ▲
+     ╔════╧═════╗
+     ║ OpReturn ║
+     ╚══════════╝
+          ▲
+          ╰──┬────────────────────────────╮
+             │                            ▼
+             │                 ╔═════════════════════╗
+    ┏━━━━━━━━┷━━━━━┓           ║ execute instruction ║
+───►┃ inspect next ┃           ╚══════════╤══════════╝
+    ┃  instruction ┃                      │
+    ┗━━━━━━━━━━━━━━┛                      │
+             ▲                      ╭─────┴─────╮
+             ╰──────────────────────┤ suspends? │
+                       [no]         ╰─────┬─────╯
+                                          │
+                                          │
+   ┏━━━━━━━━━━━━━━┓                       │
+◄──┨ return false ┃───────────────────────╯
+   ┗━━━━━━━━━━━━━━┛              [yes]
+```
+
+With this refactoring, the compiler now emits a special `OpReturn` instruction
+at the end of bytecode chunks. This is a signal to the runtime that the chunk
+has completed and that its current value should be returned, without having to
+perform instruction pointer arithmetic.
+
+When `OpReturn` is encountered, the inner bytecode loop returns control to the
+outer loop and informs it (by returning `true`) that the bytecode frame has
+completed.
+
+Any other instruction may also request a suspension of the bytecode frame (for
+example, instructions that need to force a value). In this case the inner loop
+is responsible for setting up the frame stack correctly, and returning `false`
+to inform the outer loop of the suspension
+
+### Inner generator loop
+
+The inner generator loop is responsible for driving the execution of a
+generator frame by continously calling [`Gen::resume`][] until it requests a
+suspension (as a result of which control is returned to the outer loop), or
+until the generator is done and yields a value.
+
+```
+   ┏━━━━━━━━━━━━━┓       [Done]
+◄──┨ return true ┃ ◄───────────────────╮
+   ┗━━━━━━━━━━━━━┛                     │
+                                       │
+                                       │
+                    ╭──────────────────┴─────────╮
+            ╭───────┤ inspect generator response │
+            │       ╰──────────────────┬─────────╯
+   ┏━━━━━━━━┷━━━━━━━━┓                 │
+──►┃ gen.resume(msg) ┃                 │[Yielded]
+   ┗━━━━━━━━━━━━━━━━━┛                 │
+            ▲                   ╭──────┴─────╮
+            │          [yes]    │ same-frame │
+            ╰───────────────────┤  request?  │
+                                ╰──────┬─────╯
+   ┏━━━━━━━━━━━━━━┓                    │
+◄──┨ return false ┃ ◄──────────────────╯
+   ┗━━━━━━━━━━━━━━┛                [no]
+```
+
+On each execution of a generator frame, `resume_with` is called with a
+[`VMResponse`][] (i.e. a message *from* the VM *to* the generator). For a newly
+created generator, the initial message is just `Empty`.
+
+A generator may then respond by signaling that it has finished execution
+(`Done`), in which case the inner generator loop returns control to the outer
+loop and informs it that this generator is done (by returning `true`).
+
+A generator may also respond by signaling that it needs some data from the VM.
+This is implemented through a request-response pattern, in which the generator
+returns a `Yielded` message containing a [`VMRequest`][]. These requests can be
+very simple ("Tell me the current store path") or more complex ("Call this Nix
+function with these values").
+
+Requests are divided into two classes: Same-frame requests (requests that can be
+responded to *without* returning control to the outer loop, i.e. without
+executing a *different* frame), and multi-frame generator requests. Based on the
+type of request, the inner generator loop will either handle it right away and
+send the response in a new `resume_with` call, or return `false` to the outer
+generator loop after setting up the frame stack.
+
+Most of this logic is implemented in cl/8148.
+
+[`Gen::resume`]: https://docs.rs/genawaiter/0.99.1/genawaiter/rc/struct.Gen.html#method.resume_with
+[`VMRequest`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L44
+[`VMResponse`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L169
+
+## Advantages & Disadvantages of the approach
+
+This approach has several advantages:
+
+* The execution model is much simpler than before, making it fairly
+  straightforward to build up a mental model of what the VM does.
+
+* All "out of band requests" inside the VM are handled through the same
+  abstraction (generators).
+
+* Implementation is not difficult, albeit a little verbose in some cases (we
+  can argue about whether or not to introduce macros for simplifying it).
+
+* Several parts of the VM execution are now much easier to document,
+  potentially letting us onboard tvix-eval contributors faster.
+
+* The linear VM execution itself is much easier to trace now, with for example
+  the `RuntimeObserver` (and by extension `tvixbolt`) giving much clearer
+  output now.
+
+But it also comes with some disadvantages:
+
+* Even though we "only" use the `async` keyword without a full async-I/O
+  runtime, we still encounter many of the drawbacks of the fragmented Rust
+  async ecosystem.
+
+  The biggest issue with this is that parts of the standard library become
+  unavailable to us, for example the built-in `Vec::sort_by` can no longer be
+  used for sorting in Nix because our comparators themselves are `async`.
+
+  This led us to having to implement some logic on our own, as the design of
+  `async` in Rust even makes it difficult to provide usecase-generic
+  implementations of concepts like sorting.
+
+* We need to allocate quite a few new structures on the heap in order to drive
+  generators, as generators involve storing `Future` types (with unknown
+  sizes) inside of structs.
+
+  In initial testing this seems to make no significant difference in
+  performance (our performance in an actual nixpkgs-eval is still bottlenecked
+  by I/O concerns and reference scanning), but is something to keep in mind
+  later on when we start optimising more after the low-hanging fruits have
+  been reaped.
 
 ## Alternatives considered
 
@@ -70,7 +297,19 @@ the store][wfs].
    implementation to accomodate problems as they show up. This is not
    preferred as the code is already getting messy.
 
-2. ... ?
+2. Making tvix-eval a fully `async` project, pulling in something like Tokio
+   or `async-std` as a runtime. This is not preferred due to the massively
+   increased complexity of those solutions, and all the known issues of fully
+   buying in to the async ecosystem.
+
+   tvix-eval fundamentally should work for use-cases besides building Nix
+   packages (e.g. for `//tvix/serde`), and its profile should be as slim as
+   possible.
+
+3. Convincing the Rust developers that Rust needs a way to guarantee
+   constant-stack-depth tail calls through something like a `tailcall`
+   keyword.
 
+4. ... ?
 
-[wfs]: https://docs.google.com/document/d/1Zuw9UdMy95hcqsd-KudTw5yeeUkEXrqOi0rooGB7GWA/edit
+[`genawaiter`]: https://docs.rs/genawaiter/
-- 
cgit 1.4.1