about summary refs log tree commit diff
path: root/tvix/cli/src/refscan.rs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2023-07-31 r/6447 docs(tvix/cli/refscan): fix commentFlorian Klink1-2/+2
The comment still mentions Aho-Corasick, which isn't correct. Change-Id: I846a2ca9ea7075c2456ca6ef04a132ff6161227a Reviewed-on: https://cl.tvl.fyi/c/depot/+/8991 Autosubmit: flokli <flokli@flokli.de> Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-06-02 r/6227 fix(tvix/cli): fix refscan when no paths are referencedLinus Heckemann1-4/+21
Before, the construction of a TwoByteWM would panic when no patterns were provided, as in `tvix --expr 'builtins.toFile "snens" "soos"'`. Change-Id: I25ed498c475523aec5baf8683b23059fadabb21c Reviewed-on: https://cl.tvl.fyi/c/depot/+/8697 Reviewed-by: tazjin <tazjin@tvl.su> Autosubmit: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-02-02 r/5827 fix(tvix/cli): keep tracking full paths in known_pathsVincent Ambo1-10/+10
We need to distinguish explicitly between the paths used for the scanner, and the paths that populate the derivation inputs. The full paths must be accessible from the result of the refscanner to populate drv fields correctly. This was previously hidden by debug changes that masked actual IO operations with no-ops. Change-Id: I037af6e6bbe2b573034d695f8779bee1b56bc125 Reviewed-on: https://cl.tvl.fyi/c/depot/+/8022 Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI
2023-02-02 r/5825 refactor(tvix/cli): use Wu-Manber string scanning for drv referencesVincent Ambo1-51/+10
Switch out the string-scanning algorithm used in the reference scanner. The construction of aho-corasick automata made up the vast majority of runtime when evaluating nixpkgs previously. While the actual scanning with a constructed automaton is relatively fast, we almost never scan for the same set of strings twice and the cost is not worth it. An algorithm that better matches our needs is the Wu-Manber multiple string match algorithm, which works efficiently on *long* and *random* strings of the *same length*, which describes store paths (up to their hash component). This switches the refscanner crate to a Rust implementation[0][1] of this algorithm. This has several implications: 1. This crate does not provide a way to scan streams. I'm not sure if this is an inherent problem with the algorithm (probably not, but it would need buffering). Either way, related functions and tests (which were actually unused) have been removed. 2. All strings need to be of the same length. For this reason, we truncate the known paths after their hash part (they are still unique, of course). 3. Passing an empty set of matches, or a match that is shorter than the length of a store path, causes the crate to panic. We safeguard against this by completely skipping the refscanning if there are no known paths (i.e. when evaluating the first derivation of an eval), and by bailing out of scanning a string that is shorter than a store path. On the upside, this reduces overall runtime to less 1/5 of what it was before when evaluating `pkgs.stdenv.drvPath`. [0]: Frankly, it's a random, research-grade MIT-licensed crate that I found on Github: https://github.com/jneem/wu-manber [1]: We probably want to rewrite or at least fork the above crate, and add things like a three-byte wide scanner. Evaluating large portions of nixpkgs can easily lead to more than 65k derivations being scanned for. Change-Id: I08926778e1e5d5a87fc9ac26e0437aed8bbd9eb0 Reviewed-on: https://cl.tvl.fyi/c/depot/+/8017 Tested-by: BuildkiteCI Reviewed-by: flokli <flokli@flokli.de>
2023-01-17 r/5671 refactor(tvix/cli): reference scanner owns all the stringsVincent Ambo1-48/+27
This gets very complex very quickly otherwise, as all the construction paths for a reference scanner and all the access patterns for the KnownPaths structure are not yet fully understood. Change-Id: Ibadf1f18b476695f3c286fc6896ae557760edf63 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7827 Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI
2023-01-17 r/5669 feat(tvix/cli): add known_paths moduleVincent Ambo1-5/+28
This module implements types used to track the set of known paths in the context of an evaluation. These are used to determine the build references of a derivation. Change-Id: I81e15ae33632784e699128916485751613b231a3 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7816 Tested-by: BuildkiteCI Reviewed-by: flokli <flokli@flokli.de>
2023-01-11 r/5644 feat(tvix/refscan): implement reference scanning over data streamsVincent Ambo1-0/+40
Using yet more machinery from the pretty comprehensive aho_corasick crate, this makes it possible to pass anything implementing `io::Read` to the `ReferenceScanner` to accumulate matches. Change-Id: I5b0e28eb44ea4df24010f40831e29f2cbb8c1f80 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7810 Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI
2023-01-11 r/5643 feat(tvix/cli): implement initial refscan moduleVincent Ambo1-0/+97
This module implements a ReferenceScanner struct which uses the aho_corasick crate to scan string inputs for known, non-overlapping candidates (store paths, in our case). I experimented with several different APIs, and landed on this version with an initial accumulator in the scanner. The scanner is instantiated from the candidates and "fed" all the strings, then consumed by the caller to retrieve the result. Right now only things that look vaguely like bytestrings can be fed to the scanner, there is no streaming support in the API yet. Change-Id: I7782f0f0df5fc64bccd813aa14712f5525b0168c Reviewed-on: https://cl.tvl.fyi/c/depot/+/7808 Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI