about summary refs log tree commit diff
path: root/.git-blame-ignore-revs
diff options
context:
space:
mode:
authorFlorian Klink <flokli@flokli.de>2023-11-11T10·59+0200
committerflokli <flokli@flokli.de>2023-11-11T12·24+0000
commit281cb93ba808b73d4ea4ce86f762bbcb504a09da (patch)
tree4f9b438c38784df6263520d784dcf3dd95fbb6fe /.git-blame-ignore-revs
parentaaf53614b35aeec2cf707bdc63457ff0dac42b84 (diff)
feat(users/flokli/nixos/archeology-ec2): add parse-bucket-logs r/6993
This adds a `archeology-parse-bucket-logs` CLI tool to `$PATH`.

It can be invoked like this:

```
archeology-parse-bucket-logs http://nix-cache-log.s3.amazonaws.com/log/2023-11-10-00-* bucket_logs_2023-11-10-00.pq.zstd
````

… and will produce a zstd-compressed Parquet file for (roughly) that
time range.

As the EC2 instance credentials don't give access to the logs bucket
(yet), other AWS credentials need to be provided.

This can be accomplished by using "AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY", "AWS_SESSION_TOKEN" from
"Option 2: Manually add a profile to your AWS credentials file (Short-
term credentials)" in AWS IAM Identity Center.

Processing logs for a one-hour range takes a minute or two, the
resulting zstd-compressed Parquet file is around 40-80M in size.

Processing logs for a whole day takes some 25mins, due to the sheer
amount of data (12 GB of raw log data, distributed among 450k individual
files, 20Mio log lines), but at least clickhouse isn't able to parse the
resulting parquet file back in:

> Code: 36. DB::Exception: IOError: Couldn't deserialize thrift: MaxMessageSize reached

For future automation tasks, it's probably better to run this once an
hour, and further join the data later on.

Change-Id: I6c8108c0ec17dc8d4e2dbe923175553325210a5c
Reviewed-on: https://cl.tvl.fyi/c/depot/+/10007
Tested-by: BuildkiteCI
Reviewed-by: raitobezarius <tvl@lahfa.xyz>
Diffstat (limited to '.git-blame-ignore-revs')
0 files changed, 0 insertions, 0 deletions