tvix-eval VM loop
This document describes the new tvix-eval VM execution loop implemented in the chain focusing around cl/8104.
Background
The VM loop implemented in Tvix prior to cl/8104 had several functions:
-
Advancing the instruction pointer for a chunk of Tvix bytecode and executing instructions in a loop until a result was yielded.
-
Tracking Nix call frames as functions/thunks were entered/exited.
-
Catching trampoline requests returned from instructions to force suspended thunks without increasing stack size where possible.
-
Handling trampolines through an inner trampoline loop, switching between a code execution mode and execution of subsequent trampolines.
This implementation of the trampoline logic was added on to the existing VM, which previously always recursed for thunk forcing. There are some cases (for example values that need to be forced inside of the execution of a builtin) where trampolines could not previously be used, and the VM recursed anyways.
As a result of this trampoline logic being added "on top" of the existing VM loop the code became quite difficult to understand. This led to several bugs, for example: b/251, b/246, b/245, and b/238.
These bugs were tricky to deal with, as we had to try and make the VM do things that are somewhat difficult to fit into its model. We could of course keep extending the trampoline logic to accommodate all sorts of concepts (such as finalisers), but that seems like it does not solve the root problem.
New VM loop
In cl/8104, a unified new solution is implemented with which the VM is capable of evaluating everything without increasing the call stack size.
This is done by introducing a new frame stack in the VM, on which execution frames are enqueued that are either:
- A bytecode frame, consisting of Tvix bytecode that evaluates compiled Nix code.
- A generator frame, consisting of some VM logic implemented in pure Rust code that can be suspended when it hits a point where the VM would previously need to recurse.
We do this by making use of the async
keyword in Rust, but notably
without introducing asynchronous I/O or concurrency in tvix-eval (the
complexity of which is currently undesirable for us).
Specifically, when writing a Rust function that uses the async
keyword, such
as:
async fn some_builtin(input: Value) -> Result<Value, ErrorKind> { let mut out = NixList::new(); for element in input.to_list()? { let result = do_something_that_requires_the_vm(element).await; out.push(result); } Ok(out) }
The compiler actually generates a state-machine under-the-hood which allows
the execution of that function to be suspended whenever it hits an await
.
We use the genawaiter
crate that gives us a data structure and simple
interface for getting instances of these state machines that can be stored in
a struct (in our case, a generator frame).
The execution of the VM then becomes the execution of an outer loop, which is responsible for selecting the next generator frame to execute, and two inner loops, which drive the execution of a bytecode frame or generator frame forward until it either yields a value or asks to be suspended in favour of another frame.
All "communication" between frames happens solely through values left on the stack: Whenever a frame of either type runs to completion, it is expected to leave a single value on the stack. It follows that the whole VM, upon completion of the last (or initial, depending on your perspective) frame yields its result as the return value.
The core of the VM restructuring is cl/8104, unfortunately one of the largest single commit changes we've had to make yet, as it touches pretty much all areas of tvix-eval. The introduction of the generators and the message/response system we built to request something from the VM, suspend a generator, and wait for the return is in cl/8148.
The next sections describe in detail how the three different loops work.
Outer VM loop
The outer VM loop is responsible for selecting the next frame to run, and dispatching it correctly to inner loops, as well as determining when to shut down the VM and return the final result.
╭──────────────────╮ ╭────────┤ match frame kind ├──────╮ │ ╰──────────────────╯ │ │ │ ┏━━━━━━━━━━━━┷━━━━━┓ ╭───────────┴───────────╮ ───►┃ frame_stack.pop()┃ ▼ ▼ ┗━━━━━━━━━━━━━━━━━━┛ ┏━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━┓ ▲ ┃ bytecode frame ┃ ┃ generator frame ┃ │ ┗━━━━━━━━┯━━━━━━━┛ ┗━━━━━━━━┯━━━━━━━━┛ │[yes, cont.] │ │ │ ▼ ▼ ┏━━━━━━━━┓ │ ╔════════════════╗ ╔═════════════════╗ ◄───┨ return ┃ │ ║ inner bytecode ║ ║ inner generator ║ ┗━━━━━━━━┛ │ ║ loop ║ ║ loop ║ ▲ │ ╚════════╤═══════╝ ╚════════╤════════╝ │ ╭────┴─────╮ │ │ │ │ has next │ ╰───────────┬───────────╯ [no] ╰───┤ frame? │ │ ╰────┬─────╯ ▼ │ ┏━━━━━━━━━━━━━━━━━┓ │ ┃ frame completed ┃ ╰─────────────────────────┨ or suspended ┃ ┗━━━━━━━━━━━━━━━━━┛
Initially, the VM always pops a frame from the frame stack and then inspects the type of frame it found. As a consequence the next frame to execute is always the frame at the top of the stack, and setting up a VM initially for code execution is done by leaving a bytecode frame with the code to execute on the stack and passing control to the outer loop.
Control is dispatched to either of the inner loops (depending on the type of frame) and the cycle continues once they return.
When an inner loop returns, it has either finished its execution (and left its result value on the value stack), or its frame has requested to be suspended.
Frames request suspension by re-enqueueing themselves through VM helper methods, and then leaving the frame they want to run on top of themselves in the frame stack before yielding control back to the outer loop.
The inner control loops inform the outer loops about whether the frame has been completed or suspended by returning a boolean.
Inner bytecode loop
The inner bytecode loop drives the execution of some Tvix bytecode by continously looking at the next instruction to execute, and dispatching to the instruction handler.
┏━━━━━━━━━━━━━┓ ◄──┨ return true ┃ ┗━━━━━━━━━━━━━┛ ▲ ╔════╧═════╗ ║ OpReturn ║ ╚══════════╝ ▲ ╰──┬────────────────────────────╮ │ ▼ │ ╔═════════════════════╗ ┏━━━━━━━━┷━━━━━┓ ║ execute instruction ║ ───►┃ inspect next ┃ ╚══════════╤══════════╝ ┃ instruction ┃ │ ┗━━━━━━━━━━━━━━┛ │ ▲ ╭─────┴─────╮ ╰──────────────────────┤ suspends? │ [no] ╰─────┬─────╯ │ │ ┏━━━━━━━━━━━━━━┓ │ ◄──┨ return false ┃───────────────────────╯ ┗━━━━━━━━━━━━━━┛ [yes]
With this refactoring, the compiler now emits a special OpReturn
instruction
at the end of bytecode chunks. This is a signal to the runtime that the chunk
has completed and that its current value should be returned, without having to
perform instruction pointer arithmetic.
When OpReturn
is encountered, the inner bytecode loop returns control to the
outer loop and informs it (by returning true
) that the bytecode frame has
completed.
Any other instruction may also request a suspension of the bytecode frame (for
example, instructions that need to force a value). In this case the inner loop
is responsible for setting up the frame stack correctly, and returning false
to inform the outer loop of the suspension
Inner generator loop
The inner generator loop is responsible for driving the execution of a
generator frame by continously calling Gen::resume
until it requests a
suspension (as a result of which control is returned to the outer loop), or
until the generator is done and yields a value.
┏━━━━━━━━━━━━━┓ ◄──┨ return true ┃ ◄───────────────────╮ ┗━━━━━━━━━━━━━┛ │ │ [Done] │ ╭──────────────────┴─────────╮ │ inspect generator response │◄────────────╮ ╰──────────────────┬─────────╯ │ [yielded] │ ┏━━━━━━━━┷━━━━━━━━┓ │ ┃ gen.resume(msg) ┃◄── ▼ ┗━━━━━━━━━━━━━━━━━┛ ╭────────────╮ ▲ │ same-frame │ │ │ request? ├────────────────╯ ╰─────┬──────╯ [yes] ┏━━━━━━━━━━━━━━┓ │ ◄──┨ return false ┃ ◄──────────────────╯ ┗━━━━━━━━━━━━━━┛ [no]
On each execution of a generator frame, resume_with
is called with a
VMResponse
(i.e. a message from the VM to the generator). For a newly
created generator, the initial message is just Empty
.
A generator may then respond by signaling that it has finished execution
(Done
), in which case the inner generator loop returns control to the outer
loop and informs it that this generator is done (by returning true
).
A generator may also respond by signaling that it needs some data from the VM.
This is implemented through a request-response pattern, in which the generator
returns a Yielded
message containing a VMRequest
. These requests can be
very simple ("Tell me the current store path") or more complex ("Call this Nix
function with these values").
Requests are divided into two classes: Same-frame requests (requests that can be
responded to without returning control to the outer loop, i.e. without
executing a different frame), and multi-frame generator requests. Based on the
type of request, the inner generator loop will either handle it right away and
send the response in a new resume_with
call, or return false
to the outer
generator loop after setting up the frame stack.
Most of this logic is implemented in cl/8148.
Advantages & Disadvantages of the approach
This approach has several advantages:
-
The execution model is much simpler than before, making it fairly straightforward to build up a mental model of what the VM does.
-
All "out of band requests" inside the VM are handled through the same abstraction (generators).
-
Implementation is not difficult, albeit a little verbose in some cases (we can argue about whether or not to introduce macros for simplifying it).
-
Several parts of the VM execution are now much easier to document, potentially letting us onboard tvix-eval contributors faster.
-
The linear VM execution itself is much easier to trace now, with for example the
RuntimeObserver
(and by extensiontvixbolt
) giving much clearer output now.
But it also comes with some disadvantages:
-
Even though we "only" use the
async
keyword without a full async-I/O runtime, we still encounter many of the drawbacks of the fragmented Rust async ecosystem.The biggest issue with this is that parts of the standard library become unavailable to us, for example the built-in
Vec::sort_by
can no longer be used for sorting in Nix because our comparators themselves areasync
.This led us to having to implement some logic on our own, as the design of
async
in Rust even makes it difficult to provide usecase-generic implementations of concepts like sorting. -
We need to allocate quite a few new structures on the heap in order to drive generators, as generators involve storing
Future
types (with unknown sizes) inside of structs.In initial testing this seems to make no significant difference in performance (our performance in an actual nixpkgs-eval is still bottlenecked by I/O concerns and reference scanning), but is something to keep in mind later on when we start optimising more after the low-hanging fruits have been reaped.
Alternatives considered
-
Tacking on more functionality onto the existing VM loop implementation to accomodate problems as they show up. This is not preferred as the code is already getting messy.
-
Making tvix-eval a fully
async
project, pulling in something like Tokio orasync-std
as a runtime. This is not preferred due to the massively increased complexity of those solutions, and all the known issues of fully buying in to the async ecosystem.tvix-eval fundamentally should work for use-cases besides building Nix packages (e.g. for
//tvix/serde
), and its profile should be as slim as possible. -
Convincing the Rust developers that Rust needs a way to guarantee constant-stack-depth tail calls through something like a
tailcall
keyword. -
... ?