feat(users/Profpatsch/blog): Idealized Conflang blogpost r/3981

Change-Id: Iec01bc2a1a5ed5d21627ba1f9554ec4b3a8fc4ad Reviewed-on: https://cl.tvl.fyi/c/depot/+/5480 Tested-by: BuildkiteCI Reviewed-by: Profpatsch <mail@profpatsch.de>
author: Profpatsch <mail@profpatsch.de> 2022-04-19T13·07+0200
committer: Profpatsch <mail@profpatsch.de> 2022-04-20T15·12+0000
commit: 535ad8732a688bf6ff4c48283d89dae00262a719 (patch)
tree: 78cc020e0ac1cf2e1d7a124375645ec73fd68060 /users/Profpatsch/blog
parent: ea871fdcd754828c6d47dfd199a0d9b702d33f19 (diff)
2 files changed, 311 insertions, 4 deletions
diff --git a/users/Profpatsch/blog/default.nix b/users/Profpatsch/blog/default.nix
index 44a7a21aa3..9848d83c56 100644
--- a/users/Profpatsch/blog/default.nix
+++ b/users/Profpatsch/blog/default.nix
@@ -26,11 +26,11 @@ let
   # /notes/*
   notes = [
     {
-      route = [ "notes" "preventing-oom" ];
-      name = "Preventing out-of-memory (OOM) errors on Linux";
+      route = [ "notes" "an-idealized-conflang" ];
+      name = "An Idealized Configuration Language";
       page = { cssFile }: markdownToHtml {
-        name = "preventing-oom";
-        markdown = ./notes/preventing-oom.md;
+        name = "an-idealized-conflang";
+        markdown = ./notes/an-idealized-conflang.md;
         inherit cssFile;
       };
     }
@@ -43,6 +43,15 @@ let
         inherit cssFile;
       };
     }
+    {
+      route = [ "notes" "preventing-oom" ];
+      name = "Preventing out-of-memory (OOM) errors on Linux";
+      page = { cssFile }: markdownToHtml {
+        name = "preventing-oom";
+        markdown = ./notes/preventing-oom.md;
+        inherit cssFile;
+      };
+    }
   ];
 
   projects = [
diff --git a/users/Profpatsch/blog/notes/an-idealized-conflang.md b/users/Profpatsch/blog/notes/an-idealized-conflang.md
new file mode 100644
index 0000000000..5c6b39f6e8
--- /dev/null
+++ b/users/Profpatsch/blog/notes/an-idealized-conflang.md
@@ -0,0 +1,298 @@
+tags: netencode, json
+date: 2022-03-31
+certainty: likely
+status: initial
+title: An idealized Configuration Language
+
+# An Idealized Configuration Language
+
+JSON brought us one step closer to what an idealized configuration language is,
+which I define as “data, stripped of all externalities of the system it is working in”.
+
+Specifically, JSON is very close to what I consider the minimal properties to represent structured data.
+
+## A short history, according to me
+
+In the beginning, Lisp defined s-expressions as a stand-in for an actual syntax.
+Then, people figured out that it’s also a way to represent structured data.
+It has scalars, which can be nested into lists, recursively.
+
+```
+(this is (a (list) (of lists)))
+```
+
+This provides the first three rules of our idealized language:
+
+1. A **scalar** is a primitive value that is domain-specific.
+   We can assume a bunch of bytes here, or a text or an integer.
+   
+2. A **list** gives an ordering to `0..n` (or `1..n`) values
+   
+3. Both a scalar and a list are the *same kind* of “thing” (from here on called **value**),
+   lists can be created from arbitrary values *recursively*
+   (for example scalars, or lists of scalars and other lists)
+
+
+Later, ASN.1 came and had the important insight that the same idealized data structure
+can be represented in different fashions,
+for example as a binary-efficient version and a human-readable format.
+
+Then, XML “graced” the world for a decade or two, and the main lesson from it was
+that you don’t want to mix markup languages and configuration languages,
+and that you don’t want a committee to design these things.
+
+---
+
+In the meantime, Brendan Eich designed Javascript. Its prototype-based object system
+arguably stripped down the rituals of existing OO-systems.
+Douglas Crockford later extracted the object format (minus functions) into a syntax, and we got JSON.
+
+```
+{
+  "foo": [
+    { "nested": "attrs" },
+    "some text"
+  ],
+  "bar": 42
+}
+```
+
+JSON adds another fundamental idea into the mix:
+
+4. **Records** are unordered collections of `name`/`value` pairs.
+   A `name` is defined to be a unicode string, so a semantic descriptor of the nested `value`.
+
+Unfortunately, the JSON syntax does not actually specify any semantics of records (`objects` in JSON lingo),
+in particular it does not mention what the meaning is if a `name` appears twice in one record.
+
+If records can have multiple entries with the same `name`, suddenly ordering becomes important!
+But wait, remember earlier we defined *lists* to impose ordering on two values.
+So in order to rectify that problem, we say that
+
+5. A `name` can only appear in a record *once*, names must be unique.
+
+This is the current state of the programming community at large,
+where most “modern” configuration languages basically use a version of the JSON model
+as their underlying data structure. (However not all of them use the same version.)
+
+## Improving JSON’s data model
+
+We are not yet at the final “idealized” configuration language, though.
+
+Modern languages like Standard ML define their data types as a mixture of 
+
+* *records* (“structs” in the C lingo)
+* and *sums* (which you can think about as enums that can hold more `value`s inside them)
+
+This allows to express the common pattern where some fields in a record are only meaningful
+if another field—the so-called `tag`-field—is set to a specific value.
+
+An easy example: if a request can fail with an error message or succeed with a result.
+
+You could model that as 
+
+```
+{
+  "was_error": true,
+  "error_message": "there was an error"
+}
+```
+
+or
+
+```
+{
+  "was_error": false,
+  "result": 42
+}
+```
+
+in your JSON representation.
+
+But in a ML-like language (like, for example, Rust), you would instead model it as
+
+```
+type RequestResult 
+  = Error { error_message: String }
+  | Success { result: i64 }
+```
+
+where the distinction in `Error` or `Success` makes it clear that `error_message` and `result`
+only exist in one of these cases, not the other.
+
+We *can* encode exactly that idea into JSON in multiple ways, but not a “blessed” way.
+
+For example, another way to encode the above would be
+
+```
+{ 
+  "Error": { 
+    "error_message": "there was an error"
+  }
+}
+```
+
+and
+
+```
+{ 
+  "Success": { 
+    "result": 42
+  }
+}
+```
+
+Particularly notice the difference between the language representation, where the type is “closed”only `Success` or `Error` can happen—
+and the data representation where the type is “open”, more cases could potentially exist.
+
+This is an important differentiation from a type system:
+Our idealized configuration language just gives more structure to a bag of data,
+it does not restrict which value can be where.
+Think of a value in an unityped language, like Python.
+
+
+So far we have the notion of 
+
+1. a scalar (a primitive)
+2. a list (ordering on values)
+3. a record (unordered collection of named values)
+
+and in order to get the “open” `tag`ged enumeration values, we introduce
+
+4. a `tag`, which gives a name to a value
+
+We can then redefine `record` to mean “an unordered collection of `tag`ged values”,
+which further reduces the amount of concepts needed.
+
+And that’s it, this is the full idealized configuration language.
+
+
+## Some examples of data modelling with tags
+
+This is all well and good, but what does it look like in practice?
+
+For these examples I will be using JSON with a new `< "tag": value >` syntax
+to represent `tag`s.
+
+From a compatibility standpoint, `tag`s (or sum types) have dual properties to record types.
+
+With a record, when you have a producer that *adds* a field to it, the consumer will still be able to handle the record (provided the semantics of the existing fields is not changed by the new field).
+
+With a tag, *removing* a tag from the producer will mean that the consumer will still be able to handle the tag. It might do one “dead” check on the removed `tag`, but can still handle the remaining ones just fine.
+
+<!-- TODO: some illustration here -->
+    
+An example of how that is applied in practice is that in `protobuf3`, fields of a record are *always* optional fields.
+
+We can model optional fields by wrapping them in `< "Some": value >` or `< "None": {} >` (where the actual value of the `None` is ignored or always an empty record).
+
+So a protobuf with the fields `foo: int` and `bar: string` has to be parsed by the receiver als containing *four* possibilities:
+
+№|foo|bar|
+|--:|---|---|
+|1|`<"None":{}>`|`<"None":{}>`|
+|2|`<"Some":42>`|`<"None":{}>`|
+|3|`<"None":{}>`|`<"Some":"x">`|
+|4|`<"Some":42>`|`<"Some":"x">`|
+
+Now, iff the receiver actually handles all four possibilities
+(and doesn’t just crash if a field is not set, as customary in million-dollar-mistake languages),
+it’s easy to see how removing a field from the producer is semantically equal to always setting it to `<"None":{}>`.
+Since all receivers should be ready to receive `None` for every field, this provides a simple forward-compatibility scheme.
+
+We can abstract this to any kind of tag value:
+If you start with “more” tags, you give yourself space to remove them later without breaking compatibility, typically called “forward compatibility”.
+
+
+## To empty list/record or not to
+
+Something to think about is whether records and fields should be defined
+to always contain at least one element.
+
+As it stands, JSON has multiple ways of expressing the “empty value”:
+
+* `null`
+* `[]`
+* `{}`
+* `""`
+* *leave out the field*
+
+and two of those come from the possibility of having empty structured values.
+
+## Representations of this language
+
+This line of thought originally fell out of me designing [`netencode`](https://code.tvl.fyi/tree/users/Profpatsch/netencode/README.md)
+as a small human-debuggable format for pipeline serialization.
+
+In addition to the concepts mentioned here (especially tags),
+it provides a better set of scalars than JSON (specifically arbitrary bytestrings),
+but it cannot practically be written or modified by hand,
+which might be a good thing depending on how you look at it.
+
+---
+
+The way that is compatible with the rest of the ecosystem is probably to use a subset of json
+to represent our idealized language.
+
+There is multiple ways of encoding tags in json, which each have their pros and cons.
+
+The most common is probably the “tag field” variant, where the tag is pulled into the nested record:
+
+```
+{
+  "_tag": "Success",
+  "result": 42
+}
+```
+
+Which has the advantage that people know how to deal with it and that it’s easy to “just add another field”,
+plus it is backward-compatible when you had a record in the first place.
+
+It has multiple disadvantages however:
+
+* If your value wasn’t a record (e.g. an int) before, you have to put it in a record and assign an arbitrary name to its field
+* People are not forced to “unwrap” the tag first, so they are going to forget to check it
+* The magic “_tag” name cannot be used by any of the record’s fields
+
+
+An in-between version of this with less downsides is to always push a json record onto the stack:
+
+```
+{
+  "tag": "Success",
+  "value": {
+    "result": 42
+  }
+}
+```
+
+This makes it harder for people to miss checking the `tag`, but still possible of course.
+It also makes it easily possible to inspect the contents of `value` without knowing the
+exhaustive list of `tag`s, which can be useful in practice (though often not sound!).
+It also gets rid of the “_tag” field name clash problem.
+
+Disadvantages:
+
+* Breaks the backwards-compatibility with an existing record-based approach if you want to introduce `tag`s
+* Verbosity of representation
+* hard to distinguish a record with the `tag` and `value` fields from a `tag`ed value (though you know the type layout of your data on a higher level, don’t you? ;) )
+
+
+The final, “most pure” representation is the one I gave in the original introduction:
+
+```
+{
+  "Success": {
+    "result": 42
+  }
+}
+```
+
+Now you *have* to match on the `tag` name first, before you can actually access your data,
+and it’s less verbose than the above representation.
+
+Disavantages:
+
+* You also have to *know* what `tag`s to expect, it’s harder to query cause you need to extract the keys and values from the dict and then take the first one.
+* Doing a “tag backwards compat” check is harder,
+  because you can’t just check whether `_tag` or `tag`/`value` are the keys in the dict.
author	Profpatsch <mail@profpatsch.de>	2022-04-19T13·07+0200
committer	Profpatsch <mail@profpatsch.de>	2022-04-20T15·12+0000
commit	535ad8732a688bf6ff4c48283d89dae00262a719 (patch)
tree	78cc020e0ac1cf2e1d7a124375645ec73fd68060 /users/Profpatsch/blog
parent	ea871fdcd754828c6d47dfd199a0d9b702d33f19 (diff)