diff options
author | Florian Klink <flokli@flokli.de> | 2024-06-05T08·37+0200 |
---|---|---|
committer | clbot <clbot@tvl.fyi> | 2024-06-05T08·51+0000 |
commit | 0ea55c767a5a7dc095d8218a6771190ad754aa9e (patch) | |
tree | c20e1c5ce546db0ad8e82c9dd234cd44943d92ff /tvix | |
parent | 41e2fd7fa5b14c94b8385b2ff53ad958b0fd0b55 (diff) |
docs(tvix/docs/TODO): extend O11Y section r/8214
Expand on tvix-tracing crate strategy, add some more context regarding OTLP and span propagation. Change-Id: Ice55c116c20aaf60531100465192ce11969551ac Reviewed-on: https://cl.tvl.fyi/c/depot/+/11750 Autosubmit: flokli <flokli@flokli.de> Tested-by: BuildkiteCI Reviewed-by: Simon Hauser <simon.hauser@helsinki-systems.de> Reviewed-by: flokli <flokli@flokli.de>
Diffstat (limited to 'tvix')
-rw-r--r-- | tvix/docs/src/TODO.md | 39 |
1 files changed, 33 insertions, 6 deletions
diff --git a/tvix/docs/src/TODO.md b/tvix/docs/src/TODO.md index 0fb7d70d40ce..0f1bcee27bc2 100644 --- a/tvix/docs/src/TODO.md +++ b/tvix/docs/src/TODO.md @@ -140,9 +140,36 @@ logs etc, but this is something requiring a lot of designing. - Some work ongoing on the worker operation parsing (griff, picnoir) ### O11Y - - gRPC trace propagation (cl/10532) - - `tracing-tracy` (cl/10952) - - `[tracing-]indicatif` for progress/log reporting (floklis stash) - - unification into `tvix-tracing` crate, currently a lot of boilerplate - in `tvix-store` CLI entrypoint, and half of the boilerplate copied over to - `tvix-cli`. + - `[tracing-]indicatif` for progress/log reporting (cl/11747) + - Currently there's a lot of boilerplate in the `tvix-store` CLI entrypoint, + and half of the boilerplate copied over to `tvix-cli`. + Setup of the tracing things should be unified into the `tvix-tracing` crate, + maybe including some of the CLI parameters (@simon). + Or maybe drop `--log-level` entirely, and only use `RUST_LOG` env + exclusively? `debug`,`trace` level across all crates is a bit useless, and + `RUST_LOG` can be much more granular… + - The OTLP stack is quite spammy if there's no OTLP collector running on + localhost. + https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/ + mentions a `OTEL_SDK_DISABLED` env var, but it defaults to false, so they + suggest enabling OTLP by default. + We currently have a `--otlp` cmdline arg which explicitly needs to be set to + false to stop it, in line with that "enabled by default" philosophy + Do some research if we can be less spammy. While OTLP support is + feature-flagged, it should not get in the way too much, so we can actually + have it compiled in most of the time. + - gRPC trace propagation (cl/10532 + @simon) + We need to wire trace propagation into our gRPC clients, so if we collect + traces both for the client and server they will be connected. + - Fix OTLP sending batches on shutdown. + It seems for short-lived CLI invocations we don't end up receiving all spans. + Ensure we flush these on ctrl-c, and regular process termination. + See https://github.com/open-telemetry/opentelemetry-rust/issues/1395#issuecomment-2045567608 + for some context. + +Later: + - Trace propagation for HTTP clients too, using + https://www.w3.org/TR/trace-context/ or https://www.w3.org/TR/baggage/, + whichever makes more sense. + Candidates: nix+http(s) protocol, object_store crates. + - (`tracing-tracy` (cl/10952)) |