about summary refs log tree commit diff
path: root/users/sterni/nix/utf8/default.nix (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-01-31 r/3723 style: format entire depot with nixpkgs-fmtVincent Ambo1-94/+106
This CL can be used to compare the style of nixpkgs-fmt against other formatters (nixpkgs, alejandra). Change-Id: I87c6abff6bcb546b02ead15ad0405f81e01b6d9e Reviewed-on: https://cl.tvl.fyi/c/depot/+/4397 Tested-by: BuildkiteCI Reviewed-by: sterni <sternenseemann@systemli.org> Reviewed-by: lukegb <lukegb@tvl.fyi> Reviewed-by: wpcarro <wpcarro@gmail.com> Reviewed-by: Profpatsch <mail@profpatsch.de> Reviewed-by: kanepyork <rikingcoding@gmail.com> Reviewed-by: tazjin <tazjin@tvl.su> Reviewed-by: cynthia <cynthia@tvl.fyi> Reviewed-by: edef <edef@edef.eu> Reviewed-by: eta <tvl@eta.st> Reviewed-by: grfn <grfn@gws.fyi>
2021-11-25 r/3094 feat(sterni/nix/utf8): check if codepoint valid/encodeablesterni1-2/+30
* Enforce the U+0000 to U+10FFFF range in `count` and throw an error if the given codepoint exceeds the range (encoding U+0000 won't work of course, but this is Nix's fault…). * Check if the produced bytes are well formed and output an error if not. This indicates that the codepoint can't be encoded as UTF-8, like U+D800 which is reserved for UTF-16. Change-Id: I18336e527484580f28cbfe784d51718ee15c5477
2021-11-25 r/3092 refactor(sterni/nix/utf8): let wellFormedByte check first bytesterni1-17/+14
Previously we would check the first byte only when trying to figure out the predicate for the second byte. If the first byte was invalid, we'd then throw with a helpful error message. However this made wellFormedByte a very weird function. At the expense of doing the same check twice, we now check the first byte, when it is first passed, and always return a boolean. Change-Id: I32ab6051c844711849e5b4a115e2511b53682baa
2021-11-25 r/3091 feat(sterni/nix/utf8): implement UTF-8 encodingsterni1-2/+73
This implementation is still a bit rough as it doesn't check if the produced string is valid UTF-8 which may happen if an invalid Unicode codepoint is passed. Change-Id: Ibaa91dafa8937142ef704a175efe967b62e3ee7b
2021-11-25 r/3090 chore(sterni/nix/utf8): remove decodeSafesterni1-14/+0
This is not really used anywhere and kind of useless. A better decodeSafe would never return null and instead make use of replacement characters to represent invalid bytes in the input. Change-Id: Ib4111529bf0e472dbfa720a5d0b939c2d2511de5
2021-11-23 r/3086 feat(sterni/nix/utf8): allow decoding the empty stringsterni1-2/+2
Change-Id: I8de9cd28c822ac5befbcd16e118440cd13cd86e9
2021-11-23 r/3085 refactor(sterni/nix/utf8): use genericClosure for decoding iterationsterni1-23/+46
builtins.genericClosure is a quite powerful (and undocumented) Nix primop: It repeatedly applies a function to values it produces and collects them into a list. Additionally individual results can be identified via a key attribute. Since genericClosure only ever creates a single list value internally, we can eliminate a huge performance bottleneck when building a list in a recursive algorithm: list concatenation. Because Nix needs to copy the entire chunk of memory used internally to represent the list, building big lists one element at a time grinds Nix to a halt. After rewriting decode using genericClosure decoding the LaTeX source of my 20 page term paper now takes 2s instead of 14min. Change-Id: I33847e4e7dd95d7f4d78ac83eb0d74a9867bfe80
2021-03-05 r/2270 feat(users/sterni/nix/utf8): pure nix utf-8 decodersterni1-0/+208
users.sterni.nix.utf8 implements UTF-8 decoding in pure nix. We implement the decoding as a simple state machine which is fed one byte at a time. Decoding whole strings is possible by subsequently calling step. This is done in decode which uses builtins.foldl' to get around recursion restrictions and a neat trick using builtins.deepSeq puck showed me limiting the size of the thunks in a foldl' (which can also cause a stack overflow). This makes decoding arbitrarily large UTF-8 files into codepoints using nix theoretically possible, but it is not really practical: Decoding a 36KB LaTeX file I had lying around takes ~160s on my laptop. Change-Id: Iab8c973dac89074ec280b4880a7408e0b3d19bc7 Reviewed-on: https://cl.tvl.fyi/c/depot/+/2590 Tested-by: BuildkiteCI Reviewed-by: sterni <sternenseemann@systemli.org>