From d100c1f49f93cff9b0643c842cd2ea5e878b1809 Mon Sep 17 00:00:00 2001 From: William Carroll Date: Mon, 16 May 2022 12:05:19 -0700 Subject: feat(wpcarro/ava): Support earlyoom Strange start to my Monday where I spent ~2h debugging my hanging NixOS. Strangely I'm not sure I made any changes to my configuration to trigger this, and I was finding this hard to reproduce: - graphical X sessions hung (once when opening Chrome) - TTYs hung (during `nix-build` and `rebuild-system`) Per kn's recommendations whenever a system is hanging, see if it's reachable over the network (e.g. SSH). Since I didn't have my laptop, I downloaded Termius on my iPhone, which I used to mosh into ava, which is a surprisingly nice UX. I suspect my machine (with only 8GB of RAM) was OOMing, but I'm not certain. Thanks to grfn I installed `earlyoom`. For more commentary, check-out Profpatsch's blog post about this: https://profpatsch.de/notes/preventing-oom What went well: - Thankfully I installed a Matrix client on my iPhone last week, which allowed me to troubleshoot with the #tvl folks AIs: - I'd like some instrumentation like Prometheus, Loki (`journald`, `dmesg`), so that I can accumulate troubleshooting information that isn't destroyed when I reboot my machine (which I did 1/2-dozen times today). - Consider adding `git` metadata to `system.nixos.label` to get more useful information in a GRUB/EFI context. More unknowns: - Why can't I switch back to EFI (from GRUB) for my bootloader? Change-Id: Ie2a5a15f5c0ead346d50e331fa2937f8f3453960 Reviewed-on: https://cl.tvl.fyi/c/depot/+/5625 Tested-by: BuildkiteCI Reviewed-by: wpcarro Autosubmit: wpcarro --- users/wpcarro/nixos/ava/default.nix | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/users/wpcarro/nixos/ava/default.nix b/users/wpcarro/nixos/ava/default.nix index 267b46fdf5..36529c3550 100644 --- a/users/wpcarro/nixos/ava/default.nix +++ b/users/wpcarro/nixos/ava/default.nix @@ -42,6 +42,10 @@ in }; services = wpcarro.common.services // { + # Check the amount of available memory and free swap a few times per second + # and kill the largest process if both are below 10%. + earlyoom.enable = true; + tailscale.enable = true; openssh.enable = true; -- cgit 1.4.1