about summary refs log tree commit diff
diff options
context:
space:
mode:
authorSimon Hauser <simon.hauser@helsinki-systems.de>2024-06-11T10·30+0200
committerclbot <clbot@tvl.fyi>2024-06-14T10·20+0000
commit118e69d81da73a4d55c6ab3ba7e1cc924bd3f506 (patch)
tree5508ae33d75f63b7af6315e07dbe046475734552
parenta857a2b978379ff0df66319a1eb6e1c1933cf11e (diff)
fix(tvix/tracing): reduce the error logs of otlp if collector is offline r/8273
The problem is that opentelemetry_otlp tonic batch exporter tries to
exports if either the `scheduled_delay` or if the
`max_export_batch_size` is reached. Per default the
`max_export_batch_size` is set to 512 spans, which means that we try to
export these spans once that counter is reached. Each export will then
try to connect to the exporter (if that not already happening) and will
result in a `tcp connect error`.
Increasing the max_export_batch_size to 4096 will then ensure that the
export only happens if the `scheduled_delay` is met after the 10
seconds.
`max_queue_size` is also increased, because `max_export_batch_size`
should not be greater than `max_queue_size`, so similar to the default
config its set to `max_export_batch_size * 4`.
This will reduce the amount of tries to otlp if the collector is not
available and otlp enabled.

Change-Id: Ic3430006e8a104fa3b34d274678cae55b3620ce9
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11791
Tested-by: BuildkiteCI
Reviewed-by: flokli <flokli@flokli.de>
Autosubmit: Simon Hauser <simon.hauser@helsinki-systems.de>
-rw-r--r--tvix/docs/src/TODO.md10
-rw-r--r--tvix/tracing/src/lib.rs17
2 files changed, 15 insertions, 12 deletions
diff --git a/tvix/docs/src/TODO.md b/tvix/docs/src/TODO.md
index f07bfa122a6d..195f2c6945c8 100644
--- a/tvix/docs/src/TODO.md
+++ b/tvix/docs/src/TODO.md
@@ -216,16 +216,6 @@ logs etc, but this is something requiring a lot of designing.
  - Maybe drop `--log-level` entirely, and only use `RUST_LOG` env exclusively?
    `debug`,`trace` level across all crates is a bit useless, and `RUST_LOG` can
    be much more granular…
- - The OTLP stack is quite spammy if there's no OTLP collector running on
-   localhost.
-   https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/
-   mentions a `OTEL_SDK_DISABLED` env var, but it defaults to false, so they
-   suggest enabling OTLP by default.
-   We currently have a `--otlp` cmdline arg which explicitly needs to be set to
-   false to stop it, in line with that "enabled by default" philosophy
-   Do some research if we can be less spammy. While OTLP support is
-   feature-flagged, it should not get in the way too much, so we can actually
-   have it compiled in most of the time.
  - gRPC trace propagation (cl/10532 + @simon)
    We need to wire trace propagation into our gRPC clients, so if we collect
    traces both for the client and server they will be connected.
diff --git a/tvix/tracing/src/lib.rs b/tvix/tracing/src/lib.rs
index 9cfe5afa52d2..c8c1385c3fba 100644
--- a/tvix/tracing/src/lib.rs
+++ b/tvix/tracing/src/lib.rs
@@ -10,7 +10,7 @@ use opentelemetry::KeyValue;
 #[cfg(feature = "otlp")]
 use opentelemetry_sdk::{
     resource::{ResourceDetector, SdkProvidedResourceDetector},
-    trace::BatchConfig,
+    trace::BatchConfigBuilder,
     Resource,
 };
 
@@ -173,7 +173,20 @@ impl TracingBuilder {
                 let tracer = opentelemetry_otlp::new_pipeline()
                     .tracing()
                     .with_exporter(opentelemetry_otlp::new_exporter().tonic())
-                    .with_batch_config(BatchConfig::default())
+                    .with_batch_config(
+                        BatchConfigBuilder::default()
+                            // the default values for `max_export_batch_size` is set to 512, which we will fill
+                            // pretty quickly, which will then result in an export. We want to make sure that
+                            // the export is only done once the schedule is met and not as soon as 512 spans
+                            // are collected.
+                            .with_max_export_batch_size(4096)
+                            // analog to default config `max_export_batch_size * 4`
+                            .with_max_queue_size(4096 * 4)
+                            // only force an export to the otlp collector every 10 seconds to reduce the amount
+                            // of error messages if an otlp collector is not available
+                            .with_scheduled_delay(std::time::Duration::from_secs(10))
+                            .build(),
+                    )
                     .with_trace_config(opentelemetry_sdk::trace::config().with_resource({
                         // use SdkProvidedResourceDetector.detect to detect resources,
                         // but replace the default service name with our default.