diff options
Diffstat (limited to 'users/Profpatsch/blog/notes')
-rw-r--r-- | users/Profpatsch/blog/notes/an-idealized-conflang.md | 298 | ||||
-rw-r--r-- | users/Profpatsch/blog/notes/preventing-oom.md | 33 | ||||
-rw-r--r-- | users/Profpatsch/blog/notes/rust-string-conversions.md | 53 |
3 files changed, 384 insertions, 0 deletions
diff --git a/users/Profpatsch/blog/notes/an-idealized-conflang.md b/users/Profpatsch/blog/notes/an-idealized-conflang.md new file mode 100644 index 000000000000..5c6b39f6e81b --- /dev/null +++ b/users/Profpatsch/blog/notes/an-idealized-conflang.md @@ -0,0 +1,298 @@ +tags: netencode, json +date: 2022-03-31 +certainty: likely +status: initial +title: An idealized Configuration Language + +# An Idealized Configuration Language + +JSON brought us one step closer to what an idealized configuration language is, +which I define as “data, stripped of all externalities of the system it is working in”. + +Specifically, JSON is very close to what I consider the minimal properties to represent structured data. + +## A short history, according to me + +In the beginning, Lisp defined s-expressions as a stand-in for an actual syntax. +Then, people figured out that it’s also a way to represent structured data. +It has scalars, which can be nested into lists, recursively. + +``` +(this is (a (list) (of lists))) +``` + +This provides the first three rules of our idealized language: + +1. A **scalar** is a primitive value that is domain-specific. + We can assume a bunch of bytes here, or a text or an integer. + +2. A **list** gives an ordering to `0..n` (or `1..n`) values + +3. Both a scalar and a list are the *same kind* of “thing” (from here on called **value**), + lists can be created from arbitrary values *recursively* + (for example scalars, or lists of scalars and other lists) + + +Later, ASN.1 came and had the important insight that the same idealized data structure +can be represented in different fashions, +for example as a binary-efficient version and a human-readable format. + +Then, XML “graced” the world for a decade or two, and the main lesson from it was +that you don’t want to mix markup languages and configuration languages, +and that you don’t want a committee to design these things. + +--- + +In the meantime, Brendan Eich designed Javascript. Its prototype-based object system +arguably stripped down the rituals of existing OO-systems. +Douglas Crockford later extracted the object format (minus functions) into a syntax, and we got JSON. + +``` +{ + "foo": [ + { "nested": "attrs" }, + "some text" + ], + "bar": 42 +} +``` + +JSON adds another fundamental idea into the mix: + +4. **Records** are unordered collections of `name`/`value` pairs. + A `name` is defined to be a unicode string, so a semantic descriptor of the nested `value`. + +Unfortunately, the JSON syntax does not actually specify any semantics of records (`objects` in JSON lingo), +in particular it does not mention what the meaning is if a `name` appears twice in one record. + +If records can have multiple entries with the same `name`, suddenly ordering becomes important! +But wait, remember earlier we defined *lists* to impose ordering on two values. +So in order to rectify that problem, we say that + +5. A `name` can only appear in a record *once*, names must be unique. + +This is the current state of the programming community at large, +where most “modern” configuration languages basically use a version of the JSON model +as their underlying data structure. (However not all of them use the same version.) + +## Improving JSON’s data model + +We are not yet at the final “idealized” configuration language, though. + +Modern languages like Standard ML define their data types as a mixture of + +* *records* (“structs” in the C lingo) +* and *sums* (which you can think about as enums that can hold more `value`s inside them) + +This allows to express the common pattern where some fields in a record are only meaningful +if another field—the so-called `tag`-field—is set to a specific value. + +An easy example: if a request can fail with an error message or succeed with a result. + +You could model that as + +``` +{ + "was_error": true, + "error_message": "there was an error" +} +``` + +or + +``` +{ + "was_error": false, + "result": 42 +} +``` + +in your JSON representation. + +But in a ML-like language (like, for example, Rust), you would instead model it as + +``` +type RequestResult + = Error { error_message: String } + | Success { result: i64 } +``` + +where the distinction in `Error` or `Success` makes it clear that `error_message` and `result` +only exist in one of these cases, not the other. + +We *can* encode exactly that idea into JSON in multiple ways, but not a “blessed” way. + +For example, another way to encode the above would be + +``` +{ + "Error": { + "error_message": "there was an error" + } +} +``` + +and + +``` +{ + "Success": { + "result": 42 + } +} +``` + +Particularly notice the difference between the language representation, where the type is “closed”only `Success` or `Error` can happen— +and the data representation where the type is “open”, more cases could potentially exist. + +This is an important differentiation from a type system: +Our idealized configuration language just gives more structure to a bag of data, +it does not restrict which value can be where. +Think of a value in an unityped language, like Python. + + +So far we have the notion of + +1. a scalar (a primitive) +2. a list (ordering on values) +3. a record (unordered collection of named values) + +and in order to get the “open” `tag`ged enumeration values, we introduce + +4. a `tag`, which gives a name to a value + +We can then redefine `record` to mean “an unordered collection of `tag`ged values”, +which further reduces the amount of concepts needed. + +And that’s it, this is the full idealized configuration language. + + +## Some examples of data modelling with tags + +This is all well and good, but what does it look like in practice? + +For these examples I will be using JSON with a new `< "tag": value >` syntax +to represent `tag`s. + +From a compatibility standpoint, `tag`s (or sum types) have dual properties to record types. + +With a record, when you have a producer that *adds* a field to it, the consumer will still be able to handle the record (provided the semantics of the existing fields is not changed by the new field). + +With a tag, *removing* a tag from the producer will mean that the consumer will still be able to handle the tag. It might do one “dead” check on the removed `tag`, but can still handle the remaining ones just fine. + +<!-- TODO: some illustration here --> + +An example of how that is applied in practice is that in `protobuf3`, fields of a record are *always* optional fields. + +We can model optional fields by wrapping them in `< "Some": value >` or `< "None": {} >` (where the actual value of the `None` is ignored or always an empty record). + +So a protobuf with the fields `foo: int` and `bar: string` has to be parsed by the receiver als containing *four* possibilities: + +№|foo|bar| +|--:|---|---| +|1|`<"None":{}>`|`<"None":{}>`| +|2|`<"Some":42>`|`<"None":{}>`| +|3|`<"None":{}>`|`<"Some":"x">`| +|4|`<"Some":42>`|`<"Some":"x">`| + +Now, iff the receiver actually handles all four possibilities +(and doesn’t just crash if a field is not set, as customary in million-dollar-mistake languages), +it’s easy to see how removing a field from the producer is semantically equal to always setting it to `<"None":{}>`. +Since all receivers should be ready to receive `None` for every field, this provides a simple forward-compatibility scheme. + +We can abstract this to any kind of tag value: +If you start with “more” tags, you give yourself space to remove them later without breaking compatibility, typically called “forward compatibility”. + + +## To empty list/record or not to + +Something to think about is whether records and fields should be defined +to always contain at least one element. + +As it stands, JSON has multiple ways of expressing the “empty value”: + +* `null` +* `[]` +* `{}` +* `""` +* *leave out the field* + +and two of those come from the possibility of having empty structured values. + +## Representations of this language + +This line of thought originally fell out of me designing [`netencode`](https://code.tvl.fyi/tree/users/Profpatsch/netencode/README.md) +as a small human-debuggable format for pipeline serialization. + +In addition to the concepts mentioned here (especially tags), +it provides a better set of scalars than JSON (specifically arbitrary bytestrings), +but it cannot practically be written or modified by hand, +which might be a good thing depending on how you look at it. + +--- + +The way that is compatible with the rest of the ecosystem is probably to use a subset of json +to represent our idealized language. + +There is multiple ways of encoding tags in json, which each have their pros and cons. + +The most common is probably the “tag field” variant, where the tag is pulled into the nested record: + +``` +{ + "_tag": "Success", + "result": 42 +} +``` + +Which has the advantage that people know how to deal with it and that it’s easy to “just add another field”, +plus it is backward-compatible when you had a record in the first place. + +It has multiple disadvantages however: + +* If your value wasn’t a record (e.g. an int) before, you have to put it in a record and assign an arbitrary name to its field +* People are not forced to “unwrap” the tag first, so they are going to forget to check it +* The magic “_tag” name cannot be used by any of the record’s fields + + +An in-between version of this with less downsides is to always push a json record onto the stack: + +``` +{ + "tag": "Success", + "value": { + "result": 42 + } +} +``` + +This makes it harder for people to miss checking the `tag`, but still possible of course. +It also makes it easily possible to inspect the contents of `value` without knowing the +exhaustive list of `tag`s, which can be useful in practice (though often not sound!). +It also gets rid of the “_tag” field name clash problem. + +Disadvantages: + +* Breaks the backwards-compatibility with an existing record-based approach if you want to introduce `tag`s +* Verbosity of representation +* hard to distinguish a record with the `tag` and `value` fields from a `tag`ed value (though you know the type layout of your data on a higher level, don’t you? ;) ) + + +The final, “most pure” representation is the one I gave in the original introduction: + +``` +{ + "Success": { + "result": 42 + } +} +``` + +Now you *have* to match on the `tag` name first, before you can actually access your data, +and it’s less verbose than the above representation. + +Disavantages: + +* You also have to *know* what `tag`s to expect, it’s harder to query cause you need to extract the keys and values from the dict and then take the first one. +* Doing a “tag backwards compat” check is harder, + because you can’t just check whether `_tag` or `tag`/`value` are the keys in the dict. diff --git a/users/Profpatsch/blog/notes/preventing-oom.md b/users/Profpatsch/blog/notes/preventing-oom.md new file mode 100644 index 000000000000..59ea4f747700 --- /dev/null +++ b/users/Profpatsch/blog/notes/preventing-oom.md @@ -0,0 +1,33 @@ +tags: linux +date: 2020-01-25 +certainty: likely +status: initial +title: Preventing out-of-memory (OOM) errors on Linux + +# Preventing out-of-memory (OOM) errors on Linux + +I’ve been running out of memory more and more often lately. I don’t use any swap space because I am of the opinion that 16GB of memory should be sufficient for most daily and professional tasks. Which is generally true, however sometimes I have a runaway filling my memory. Emacs is very good at doing this for example, prone to filling your RAM when you open json files with very long lines. + +In theory, the kernel OOM killer should come in and save the day, but the Linux OOM killer is notorious for being extremely … conservative. It will try to free every internal structure it can before even thinking about touching any userspace processes. At that point, the desktop usually stopped responding minutes ago. + +Luckily the kernel provides memory statistics for the whole system, as well as single process, and the [`earlyoom`](https://github.com/rfjakob/earlyoom) tool uses those to keep memory usage under a certain limit. It will start killing processes, “heaviest” first, until the given upper memory limit is satisfied again. + +On NixOS, I set: + +```nix +{ + services.earlyoom = { + enable = true; + freeMemThreshold = 5; # <%5 free + }; +} +``` + +and after activation, this simple test shows whether the daemon is working: + +```shell +$ tail /dev/zero +fish: “tail /dev/zero” terminated by signal SIGTERM (Polite quit request) +``` + +`tail /dev/zero` searches for the last line of the file `/dev/zero`, and since it cannot know that there is no next line and no end to the stream of `\0` this file produces, it will fill the RAM as quickly as physically possible. Before it can fill it completely, `earlyoom` recognizes that the limit was breached, singles out the `tail` command as the process using the most amount of memory, and sends it a `SIGTERM`. diff --git a/users/Profpatsch/blog/notes/rust-string-conversions.md b/users/Profpatsch/blog/notes/rust-string-conversions.md new file mode 100644 index 000000000000..99071ef9d370 --- /dev/null +++ b/users/Profpatsch/blog/notes/rust-string-conversions.md @@ -0,0 +1,53 @@ +# Converting between different String types in Rust + +``` +let s: String = ... +let st: &str = ... +let u: &[u8] = ... +let b: [u8; 3] = b"foo" +let v: Vec<u8> = ... +let os: OsString = ... +let ost: OsStr = ... + +From To Use Comment +---- -- --- ------- +&str -> String String::from(st) +&str -> &[u8] st.as_bytes() +&str -> Vec<u8> st.as_bytes().to_owned() via &[u8] +&str -> &OsStr OsStr::new(st) + +String -> &str &s alt. s.as_str() +String -> &[u8] s.as_bytes() +String -> Vec<u8> s.into_bytes() +String -> OsString OsString::from(s) + +&[u8] -> &str str::from_utf8(u).unwrap() +&[u8] -> String String::from_utf8(u).unwrap() +&[u8] -> Vec<u8> u.to_owned() +&[u8] -> &OsStr OsStr::from_bytes(u) use std::os::unix::ffi::OsStrExt; + +[u8; 3] -> &[u8] &b[..] byte literal +[u8; 3] -> &[u8] "foo".as_bytes() alternative via utf8 literal + +Vec<u8> -> &str str::from_utf8(&v).unwrap() via &[u8] +Vec<u8> -> String String::from_utf8(v) +Vec<u8> -> &[u8] &v +Vec<u8> -> OsString OsString::from_vec(v) use std::os::unix::ffi::OsStringExt; + +&OsStr -> &str ost.to_str().unwrap() +&OsStr -> String ost.to_os_string().into_string() via OsString + .unwrap() +&OsStr -> Cow<str> ost.to_string_lossy() Unicode replacement characters +&OsStr -> OsString ost.to_os_string() +&OsStr -> &[u8] ost.as_bytes() use std::os::unix::ffi::OsStringExt; + +OsString -> String os.into_string().unwrap() returns original OsString on failure +OsString -> &str os.to_str().unwrap() +OsString -> &OsStr os.as_os_str() +OsString -> Vec<u8> os.into_vec() use std::os::unix::ffi::OsStringExt; +``` + + +## Source + +Original source is [this document on Pastebin](https://web.archive.org/web/20190710121935/https://pastebin.com/Mhfc6b9i) |