tvix-eval VM loop
=================

This document describes the new tvix-eval VM execution loop implemented in the
chain focusing around cl/8104.

## Background

The VM loop implemented in Tvix prior to cl/8104 had several functions:

1. Advancing the instruction pointer for a chunk of Tvix bytecode and
   executing instructions in a loop until a result was yielded.

2. Tracking Nix call frames as functions/thunks were entered/exited.

3. Catching trampoline requests returned from instructions to force suspended
   thunks without increasing stack size *where possible*.

4. Handling trampolines through an inner trampoline loop, switching between a
   code execution mode and execution of subsequent trampolines.

This implementation of the trampoline logic was added on to the existing VM,
which previously always recursed for thunk forcing. There are some cases (for
example values that need to be forced *inside* of the execution of a builtin)
where trampolines could not previously be used, and the VM recursed anyways.

As a result of this trampoline logic being added "on top" of the existing VM
loop the code became quite difficult to understand. This led to several bugs,
for example: b/251, b/246, b/245, and b/238.

These bugs were tricky to deal with, as we had to try and make the VM do
things that are somewhat difficult to fit into its model. We could of course
keep extending the trampoline logic to accommodate all sorts of concepts (such
as finalisers), but that seems like it does not solve the root problem.

## New VM loop

In cl/8104, a unified new solution is implemented with which the VM is capable
of evaluating everything without increasing the call stack size.

This is done by introducing a new frame stack in the VM, on which execution
frames are enqueued that are either:

1. A bytecode frame, consisting of Tvix bytecode that evaluates compiled Nix
   code.
2. A generator frame, consisting of some VM logic implemented in pure Rust
   code that can be *suspended* when it hits a point where the VM would
   previously need to recurse.

We do this by making use of the `async` *keyword* in Rust, but notably
*without* introducing asynchronous I/O or concurrency in tvix-eval (the
complexity of which is currently undesirable for us).

Specifically, when writing a Rust function that uses the `async` keyword, such
as:

```rust
async fn some_builtin(input: Value) -> Result<Value, ErrorKind> {
  let mut out = NixList::new();

  for element in input.to_list()? {
    let result = do_something_that_requires_the_vm(element).await;
    out.push(result);
  }

  Ok(out)
}
```

The compiler actually generates a state-machine under-the-hood which allows
the execution of that function to be *suspended* whenever it hits an `await`.

We use the [`genawaiter`][] crate that gives us a data structure and simple
interface for getting instances of these state machines that can be stored in
a struct (in our case, a *generator frame*).

The execution of the VM then becomes the execution of an *outer loop*, which
is responsible for selecting the next generator frame to execute, and two
*inner loops*, which drive the execution of a bytecode frame or generator
frame forward until it either yields a value or asks to be suspended in favour
of another frame.

All "communication" between frames happens solely through values left on the
stack: Whenever a frame of either type runs to completion, it is expected to
leave a *single* value on the stack. It follows that the whole VM, upon
completion of the last (or initial, depending on your perspective) frame
yields its result as the return value.

The core of the VM restructuring is cl/8104, unfortunately one of the largest
single commit changes we've had to make yet, as it touches pretty much all
areas of tvix-eval. The introduction of the generators and the
message/response system we built to request something from the VM, suspend a
generator, and wait for the return is in cl/8148.

The next sections describe in detail how the three different loops work.

### Outer VM loop

The outer VM loop is responsible for selecting the next frame to run, and
dispatching it correctly to inner loops, as well as determining when to shut
down the VM and return the final result.

```
                          ╭──────────────────╮
                 ╭────────┤ match frame kind ├──────╮
                 │        ╰──────────────────╯      │
                 │                                  │
    ┏━━━━━━━━━━━━┷━━━━━┓                ╭───────────┴───────────╮
───►┃ frame_stack.pop()┃                ▼                       ▼
    ┗━━━━━━━━━━━━━━━━━━┛       ┏━━━━━━━━━━━━━━━━┓      ┏━━━━━━━━━━━━━━━━━┓
                 ▲             ┃ bytecode frame ┃      ┃ generator frame ┃
                 │             ┗━━━━━━━━┯━━━━━━━┛      ┗━━━━━━━━┯━━━━━━━━┛
                 │[yes, cont.]          │                       │
                 │                      ▼                       ▼
    ┏━━━━━━━━┓   │             ╔════════════════╗      ╔═════════════════╗
◄───┨ return ┃   │             ║ inner bytecode ║      ║ inner generator ║
    ┗━━━━━━━━┛   │             ║      loop      ║      ║      loop       ║
        ▲        │             ╚════════╤═══════╝      ╚════════╤════════╝
        │   ╭────┴─────╮                │                       │
        │   │ has next │                ╰───────────┬───────────╯
   [no] ╰───┤  frame?  │                            │
            ╰────┬─────╯                            ▼
                 │                         ┏━━━━━━━━━━━━━━━━━┓
                 │                         ┃ frame completed ┃
                 ╰─────────────────────────┨  or suspended   ┃
                                           ┗━━━━━━━━━━━━━━━━━┛
```

Initially, the VM always pops a frame from the frame stack and then inspects
the type of frame it found. As a consequence the next frame to execute is
always the frame at the top of the stack, and setting up a VM initially for
code execution is done by leaving a bytecode frame with the code to execute on
the stack and passing control to the outer loop.

Control is dispatched to either of the inner loops (depending on the type of
frame) and the cycle continues once they return.

When an inner loop returns, it has either finished its execution (and left its
result value on the *value stack*), or its frame has requested to be
suspended.

Frames request suspension by re-enqueueing *themselves* through VM helper
methods, and then leaving the frame they want to run *on top* of themselves in
the frame stack before yielding control back to the outer loop.

The inner control loops inform the outer loops about whether the frame has
been *completed* or *suspended* by returning a boolean.

### Inner bytecode loop

The inner bytecode loop drives the execution of some Tvix bytecode by
continously looking at the next instruction to execute, and dispatching to the
instruction handler.

```
   ┏━━━━━━━━━━━━━┓
◄──┨ return true ┃
   ┗━━━━━━━━━━━━━┛
          ▲
     ╔════╧═════╗
     ║ OpReturn ║
     ╚══════════╝
          ▲
          ╰──┬────────────────────────────╮
             │                            ▼
             │                 ╔═════════════════════╗
    ┏━━━━━━━━┷━━━━━┓           ║ execute instruction ║
───►┃ inspect next ┃           ╚══════════╤══════════╝
    ┃  instruction ┃                      │
    ┗━━━━━━━━━━━━━━┛                      │
             ▲                      ╭─────┴─────╮
             ╰──────────────────────┤ suspends? │
                       [no]         ╰─────┬─────╯
                                          │
                                          │
   ┏━━━━━━━━━━━━━━┓                       │
◄──┨ return false ┃───────────────────────╯
   ┗━━━━━━━━━━━━━━┛              [yes]
```

With this refactoring, the compiler now emits a special `OpReturn` instruction
at the end of bytecode chunks. This is a signal to the runtime that the chunk
has completed and that its current value should be returned, without having to
perform instruction pointer arithmetic.

When `OpReturn` is encountered, the inner bytecode loop returns control to the
outer loop and informs it (by returning `true`) that the bytecode frame has
completed.

Any other instruction may also request a suspension of the bytecode frame (for
example, instructions that need to force a value). In this case the inner loop
is responsible for setting up the frame stack correctly, and returning `false`
to inform the outer loop of the suspension

### Inner generator loop

The inner generator loop is responsible for driving the execution of a
generator frame by continously calling [`Gen::resume`][] until it requests a
suspension (as a result of which control is returned to the outer loop), or
until the generator is done and yields a value.

```
   ┏━━━━━━━━━━━━━┓
◄──┨ return true ┃ ◄───────────────────╮
   ┗━━━━━━━━━━━━━┛                     │
                                       │
                               [Done]  │
                    ╭──────────────────┴─────────╮
                    │ inspect generator response │◄────────────╮
                    ╰──────────────────┬─────────╯             │
                            [yielded]  │              ┏━━━━━━━━┷━━━━━━━━┓
                                       │              ┃ gen.resume(msg) ┃◄──
                                       ▼              ┗━━━━━━━━━━━━━━━━━┛
                                 ╭────────────╮                ▲
                                 │ same-frame │                │
                                 │  request?  ├────────────────╯
                                 ╰─────┬──────╯      [yes]
   ┏━━━━━━━━━━━━━━┓                    │
◄──┨ return false ┃ ◄──────────────────╯
   ┗━━━━━━━━━━━━━━┛                [no]
```

On each execution of a generator frame, `resume_with` is called with a
[`VMResponse`][] (i.e. a message *from* the VM *to* the generator). For a newly
created generator, the initial message is just `Empty`.

A generator may then respond by signaling that it has finished execution
(`Done`), in which case the inner generator loop returns control to the outer
loop and informs it that this generator is done (by returning `true`).

A generator may also respond by signaling that it needs some data from the VM.
This is implemented through a request-response pattern, in which the generator
returns a `Yielded` message containing a [`VMRequest`][]. These requests can be
very simple ("Tell me the current store path") or more complex ("Call this Nix
function with these values").

Requests are divided into two classes: Same-frame requests (requests that can be
responded to *without* returning control to the outer loop, i.e. without
executing a *different* frame), and multi-frame generator requests. Based on the
type of request, the inner generator loop will either handle it right away and
send the response in a new `resume_with` call, or return `false` to the outer
generator loop after setting up the frame stack.

Most of this logic is implemented in cl/8148.

[`Gen::resume`]: https://docs.rs/genawaiter/0.99.1/genawaiter/rc/struct.Gen.html#method.resume_with
[`VMRequest`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L44
[`VMResponse`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L169

## Advantages & Disadvantages of the approach

This approach has several advantages:

* The execution model is much simpler than before, making it fairly
  straightforward to build up a mental model of what the VM does.

* All "out of band requests" inside the VM are handled through the same
  abstraction (generators).

* Implementation is not difficult, albeit a little verbose in some cases (we
  can argue about whether or not to introduce macros for simplifying it).

* Several parts of the VM execution are now much easier to document,
  potentially letting us onboard tvix-eval contributors faster.

* The linear VM execution itself is much easier to trace now, with for example
  the `RuntimeObserver` (and by extension `tvixbolt`) giving much clearer
  output now.

But it also comes with some disadvantages:

* Even though we "only" use the `async` keyword without a full async-I/O
  runtime, we still encounter many of the drawbacks of the fragmented Rust
  async ecosystem.

  The biggest issue with this is that parts of the standard library become
  unavailable to us, for example the built-in `Vec::sort_by` can no longer be
  used for sorting in Nix because our comparators themselves are `async`.

  This led us to having to implement some logic on our own, as the design of
  `async` in Rust even makes it difficult to provide usecase-generic
  implementations of concepts like sorting.

* We need to allocate quite a few new structures on the heap in order to drive
  generators, as generators involve storing `Future` types (with unknown
  sizes) inside of structs.

  In initial testing this seems to make no significant difference in
  performance (our performance in an actual nixpkgs-eval is still bottlenecked
  by I/O concerns and reference scanning), but is something to keep in mind
  later on when we start optimising more after the low-hanging fruits have
  been reaped.

## Alternatives considered

1. Tacking on more functionality onto the existing VM loop
   implementation to accomodate problems as they show up. This is not
   preferred as the code is already getting messy.

2. Making tvix-eval a fully `async` project, pulling in something like Tokio
   or `async-std` as a runtime. This is not preferred due to the massively
   increased complexity of those solutions, and all the known issues of fully
   buying in to the async ecosystem.

   tvix-eval fundamentally should work for use-cases besides building Nix
   packages (e.g. for `//tvix/serde`), and its profile should be as slim as
   possible.

3. Convincing the Rust developers that Rust needs a way to guarantee
   constant-stack-depth tail calls through something like a `tailcall`
   keyword.

4. ... ?

[`genawaiter`]: https://docs.rs/genawaiter/