From 0ea55c767a5a7dc095d8218a6771190ad754aa9e Mon Sep 17 00:00:00 2001 From: Florian Klink Date: Wed, 5 Jun 2024 10:37:35 +0200 Subject: docs(tvix/docs/TODO): extend O11Y section Expand on tvix-tracing crate strategy, add some more context regarding OTLP and span propagation. Change-Id: Ice55c116c20aaf60531100465192ce11969551ac Reviewed-on: https://cl.tvl.fyi/c/depot/+/11750 Autosubmit: flokli Tested-by: BuildkiteCI Reviewed-by: Simon Hauser Reviewed-by: flokli --- tvix/docs/src/TODO.md | 39 +++++++++++++++++++++++++++++++++------ 1 file changed, 33 insertions(+), 6 deletions(-) diff --git a/tvix/docs/src/TODO.md b/tvix/docs/src/TODO.md index 0fb7d70d40ce..0f1bcee27bc2 100644 --- a/tvix/docs/src/TODO.md +++ b/tvix/docs/src/TODO.md @@ -140,9 +140,36 @@ logs etc, but this is something requiring a lot of designing. - Some work ongoing on the worker operation parsing (griff, picnoir) ### O11Y - - gRPC trace propagation (cl/10532) - - `tracing-tracy` (cl/10952) - - `[tracing-]indicatif` for progress/log reporting (floklis stash) - - unification into `tvix-tracing` crate, currently a lot of boilerplate - in `tvix-store` CLI entrypoint, and half of the boilerplate copied over to - `tvix-cli`. + - `[tracing-]indicatif` for progress/log reporting (cl/11747) + - Currently there's a lot of boilerplate in the `tvix-store` CLI entrypoint, + and half of the boilerplate copied over to `tvix-cli`. + Setup of the tracing things should be unified into the `tvix-tracing` crate, + maybe including some of the CLI parameters (@simon). + Or maybe drop `--log-level` entirely, and only use `RUST_LOG` env + exclusively? `debug`,`trace` level across all crates is a bit useless, and + `RUST_LOG` can be much more granular… + - The OTLP stack is quite spammy if there's no OTLP collector running on + localhost. + https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/ + mentions a `OTEL_SDK_DISABLED` env var, but it defaults to false, so they + suggest enabling OTLP by default. + We currently have a `--otlp` cmdline arg which explicitly needs to be set to + false to stop it, in line with that "enabled by default" philosophy + Do some research if we can be less spammy. While OTLP support is + feature-flagged, it should not get in the way too much, so we can actually + have it compiled in most of the time. + - gRPC trace propagation (cl/10532 + @simon) + We need to wire trace propagation into our gRPC clients, so if we collect + traces both for the client and server they will be connected. + - Fix OTLP sending batches on shutdown. + It seems for short-lived CLI invocations we don't end up receiving all spans. + Ensure we flush these on ctrl-c, and regular process termination. + See https://github.com/open-telemetry/opentelemetry-rust/issues/1395#issuecomment-2045567608 + for some context. + +Later: + - Trace propagation for HTTP clients too, using + https://www.w3.org/TR/trace-context/ or https://www.w3.org/TR/baggage/, + whichever makes more sense. + Candidates: nix+http(s) protocol, object_store crates. + - (`tracing-tracy` (cl/10952)) -- cgit 1.4.1