about summary refs log tree commit diff
diff options
context:
space:
mode:
authorFlorian Klink <flokli@flokli.de>2024-06-05T08·37+0200
committerclbot <clbot@tvl.fyi>2024-06-05T08·51+0000
commit0ea55c767a5a7dc095d8218a6771190ad754aa9e (patch)
treec20e1c5ce546db0ad8e82c9dd234cd44943d92ff
parent41e2fd7fa5b14c94b8385b2ff53ad958b0fd0b55 (diff)
docs(tvix/docs/TODO): extend O11Y section r/8214
Expand on tvix-tracing crate strategy, add some more context regarding
OTLP and span propagation.

Change-Id: Ice55c116c20aaf60531100465192ce11969551ac
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11750
Autosubmit: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
Reviewed-by: Simon Hauser <simon.hauser@helsinki-systems.de>
Reviewed-by: flokli <flokli@flokli.de>
-rw-r--r--tvix/docs/src/TODO.md39
1 files changed, 33 insertions, 6 deletions
diff --git a/tvix/docs/src/TODO.md b/tvix/docs/src/TODO.md
index 0fb7d70d40..0f1bcee27b 100644
--- a/tvix/docs/src/TODO.md
+++ b/tvix/docs/src/TODO.md
@@ -140,9 +140,36 @@ logs etc, but this is something requiring a lot of designing.
 - Some work ongoing on the worker operation parsing (griff, picnoir)
 
 ### O11Y
- - gRPC trace propagation (cl/10532)
- - `tracing-tracy` (cl/10952)
- - `[tracing-]indicatif` for progress/log reporting (floklis stash)
- - unification into `tvix-tracing` crate, currently a lot of boilerplate
-   in `tvix-store` CLI entrypoint, and half of the boilerplate copied over to
-   `tvix-cli`.
+ - `[tracing-]indicatif` for progress/log reporting (cl/11747)
+ - Currently there's a lot of boilerplate in the `tvix-store` CLI entrypoint,
+   and half of the boilerplate copied over to `tvix-cli`.
+   Setup of the tracing things should be unified into the `tvix-tracing` crate,
+   maybe including some of the CLI parameters (@simon).
+   Or maybe drop `--log-level` entirely, and only use `RUST_LOG` env
+   exclusively? `debug`,`trace` level across all crates is a bit useless, and
+   `RUST_LOG` can be much more granular…
+ - The OTLP stack is quite spammy if there's no OTLP collector running on
+   localhost.
+   https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/
+   mentions a `OTEL_SDK_DISABLED` env var, but it defaults to false, so they
+   suggest enabling OTLP by default.
+   We currently have a `--otlp` cmdline arg which explicitly needs to be set to
+   false to stop it, in line with that "enabled by default" philosophy
+   Do some research if we can be less spammy. While OTLP support is
+   feature-flagged, it should not get in the way too much, so we can actually
+   have it compiled in most of the time.
+ - gRPC trace propagation (cl/10532 + @simon)
+   We need to wire trace propagation into our gRPC clients, so if we collect
+   traces both for the client and server they will be connected.
+ - Fix OTLP sending batches on shutdown.
+   It seems for short-lived CLI invocations we don't end up receiving all spans.
+   Ensure we flush these on ctrl-c, and regular process termination.
+   See https://github.com/open-telemetry/opentelemetry-rust/issues/1395#issuecomment-2045567608
+   for some context.
+
+Later:
+ - Trace propagation for HTTP clients too, using
+   https://www.w3.org/TR/trace-context/ or https://www.w3.org/TR/baggage/,
+   whichever makes more sense.
+   Candidates: nix+http(s) protocol, object_store crates.
+ - (`tracing-tracy` (cl/10952))