about summary refs log tree commit diff
path: root/users/glittershark/xanthous/test/Xanthous
diff options
context:
space:
mode:
authorRyan Lahfa <tvl@lahfa.xyz>2024-01-05T23·41+0100
committerclbot <clbot@tvl.fyi>2024-01-20T17·16+0000
commit1f1a42b4da34bb2ad0cd75d6c822e2d24a19c0a2 (patch)
treed635b4b831c6ea80770540ad74eb52840ec3843d /users/glittershark/xanthous/test/Xanthous
parente98ea31bbd17e71566d56810fa9c1d08960461ad (diff)
feat(tvix/castore): ingestion does DFS and invert it r/7430
To make use of the filtering feature, we need to revert the internal walker to a real DFS.

We will therefore just invert the whole tree by storing all of its
contents in a level-keyed vector.

This is horribly expensive in memory, this is a compromise between CPU
and memory, here is the fundamental reason for why:

When you encounter a directory, it's either a leaf or not, i.e. it
contains subdirectories or not.

To know this fact, you can:

- wait until you notice subdirectories under it, i.e. you need to store
  any intermediate nodes you see in the meantime -> memory penalty.
- getdents or readdir on it to determine *NOW* its subdirectories -> CPU
  penalty and I/O penalty.

This is an implementation of the first proposal, we pay memory.

In practice, we are paying O(#nb of nodes) in memory.

There's a smarter albeit much more complicated algorithm that pays only
O(\sum_i #siblings(p_i)) nodes where (p_1, ..., p_n) is the path to a leaf.

which means for:

             A
            / \
           B   C
          /   / \
         D   E   F

We would never store D, E, F but only E, F at a given time.
But we would still store B, C no matter what.

Change-Id: I456ed1c3f0db493e018ba1182665d84bebe29c11
Reviewed-on: https://cl.tvl.fyi/c/depot/+/10567
Tested-by: BuildkiteCI
Autosubmit: raitobezarius <tvl@lahfa.xyz>
Reviewed-by: flokli <flokli@flokli.de>
Diffstat (limited to 'users/glittershark/xanthous/test/Xanthous')
0 files changed, 0 insertions, 0 deletions