about summary refs log tree commit diff
path: root/users/flokli/archeology
diff options
context:
space:
mode:
authorFlorian Klink <flokli@flokli.de>2023-11-14T17·36+0200
committerflokli <flokli@flokli.de>2023-11-14T21·46+0000
commitabe099b6ba053beaad2c1cb6fc01179578651920 (patch)
tree454552824c9ac5dd9a5a543b658d3beb0ed22bd4 /users/flokli/archeology
parent1091b1e6230711722193f92adb6d8ebcf7396f1a (diff)
fix(users/flokli/archeology/parse_bucket_logs): fix regex and skip r/7013
It seems the regex is not perfect, it choked on a single log line:

```
Nov 13 03:10:19 archeology-ec2 59nkrwmih3ywaxrgxqj79pn395fs6m17-parse-bucket-logs-continuously[11105]: Code: 117. DB::Exception: Line "d57bd890fbd1ae16625bdb8168064125e013198099b7e1b3c24878a4d03c3ab8 nix-cache [12/Nov/2023:09:13:02 +0000] xxx.xx.xxx.xxx - VB7SJVZ108DSSN67 REST.POST.OBJECT index.html "POST /index.html HTTP/1.1" 405 MethodNotAllowed 348 - 4 - "-" "Mozilla/5.0 (Macintosh;                 Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML,                 like Gecko) Chrome/39.0.2171.95 Safari/537.36" - 0bFdGKbi0n9JHXU1a2hijcJwmYdc6lG2xgbdozc3wS6mlUkBE7ssrQCHIDdOLebo78o2cGbhivY= - ECDHE-RSA-AES128-GCM-SHA256 - nix-cache.s3.amazonaws.com TLSv1.2 - -" doesn't match the regexp.: (in file/uri log/2023-11-12-10-19-50-80805A702ECF65EB): (at row 5)
```

This was due to the user-agent field. The regex is now fixed.

The request itself is fun (someone trying to POST an index.html to the
bucket), and we should probably filter this on the Fastly side already,
not via IAM,

In any case, there's no point failing to parse if a single line doesn't
match the regex - we can just skip them.

For the sake of completeness, logs for that day have been reprocessed
and reuploaded.

Change-Id: Id98a7167a381cda06d150ad5118ee9e70ead277e
Reviewed-on: https://cl.tvl.fyi/c/depot/+/10034
Tested-by: BuildkiteCI
Reviewed-by: flokli <flokli@flokli.de>
Diffstat (limited to 'users/flokli/archeology')
-rw-r--r--users/flokli/archeology/parse_bucket_logs.rs3
1 files changed, 2 insertions, 1 deletions
diff --git a/users/flokli/archeology/parse_bucket_logs.rs b/users/flokli/archeology/parse_bucket_logs.rs
index c794222f5b7d..3ab2e133b34c 100644
--- a/users/flokli/archeology/parse_bucket_logs.rs
+++ b/users/flokli/archeology/parse_bucket_logs.rs
@@ -31,7 +31,8 @@ fn main() -> ExitCode {
     )
     ORDER BY timestamp ASC
     SETTINGS
-        format_regexp = '(\\S+) (\\S+) \\[(.*)\\] (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) ((?:\\S+ \\S+ \\S+)|\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+).*',
+        format_regexp_skip_unmatched = 1,
+        format_regexp = '(\\S+) (\\S+) \\[(.*)\\] (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) ((?:\\S+ \\S+ \\S+)|\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) ("\\S+") (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+).*',
         output_format_parquet_compression_method = 'zstd'
     INTO OUTFILE '{}' FORMAT Parquet"#, input_files, output_file));