Age | Commit message (Collapse) | Author | Files | Lines |
|
* Enforce the U+0000 to U+10FFFF range in `count` and throw an error if
the given codepoint exceeds the range (encoding U+0000 won't work of
course, but this is Nix's fault…).
* Check if the produced bytes are well formed and output an error if
not. This indicates that the codepoint can't be encoded as UTF-8, like
U+D800 which is reserved for UTF-16.
Change-Id: I18336e527484580f28cbfe784d51718ee15c5477
|
|
Previously we would check the first byte only when trying to figure out
the predicate for the second byte. If the first byte was invalid, we'd
then throw with a helpful error message. However this made
wellFormedByte a very weird function.
At the expense of doing the same check twice, we now check the first
byte, when it is first passed, and always return a boolean.
Change-Id: I32ab6051c844711849e5b4a115e2511b53682baa
|
|
This implementation is still a bit rough as it doesn't check if the
produced string is valid UTF-8 which may happen if an invalid Unicode
codepoint is passed.
Change-Id: Ibaa91dafa8937142ef704a175efe967b62e3ee7b
|
|
This is not really used anywhere and kind of useless. A better
decodeSafe would never return null and instead make use of replacement
characters to represent invalid bytes in the input.
Change-Id: Ib4111529bf0e472dbfa720a5d0b939c2d2511de5
|
|
Change-Id: I8de9cd28c822ac5befbcd16e118440cd13cd86e9
|
|
builtins.genericClosure is a quite powerful (and undocumented) Nix
primop: It repeatedly applies a function to values it produces and
collects them into a list. Additionally individual results can be
identified via a key attribute.
Since genericClosure only ever creates a single list value internally,
we can eliminate a huge performance bottleneck when building a list in a
recursive algorithm: list concatenation. Because Nix needs to copy the
entire chunk of memory used internally to represent the list, building
big lists one element at a time grinds Nix to a halt.
After rewriting decode using genericClosure decoding the LaTeX source
of my 20 page term paper now takes 2s instead of 14min.
Change-Id: I33847e4e7dd95d7f4d78ac83eb0d74a9867bfe80
|
|
users.sterni.nix.utf8 implements UTF-8 decoding in pure nix. We
implement the decoding as a simple state machine which is fed one byte
at a time. Decoding whole strings is possible by subsequently calling
step. This is done in decode which uses builtins.foldl' to get around
recursion restrictions and a neat trick using builtins.deepSeq puck
showed me limiting the size of the thunks in a foldl' (which can also
cause a stack overflow).
This makes decoding arbitrarily large UTF-8 files into codepoints using
nix theoretically possible, but it is not really practical: Decoding a
36KB LaTeX file I had lying around takes ~160s on my laptop.
Change-Id: Iab8c973dac89074ec280b4880a7408e0b3d19bc7
Reviewed-on: https://cl.tvl.fyi/c/depot/+/2590
Tested-by: BuildkiteCI
Reviewed-by: sterni <sternenseemann@systemli.org>
|