about summary refs log tree commit diff
diff options
context:
space:
mode:
authorFlorian Klink <flokli@flokli.de>2024-08-18T16·17+0300
committerflokli <flokli@flokli.de>2024-08-19T10·07+0000
commitbb5d7c96783656a5c1be4bb93914032f40c37ab4 (patch)
treeb52139c28ae7accd31eff1c2006d0a50bf0f0d7c
parent98863e731221e3cd70f88b0f1e6646d958676e6c (diff)
feat(ops/pipelines): support buildkite retries r/8517
cl/12228 did enable automatic retries for some flaky tests, which
generally did work, as can be seen in
https://buildkite.com/tvl/depot/builds/35893

However, ":duck:" still reports as failing, because we check the number
of steps to be nonzero, which is not the case if retries have happened.

We cannot check for the overall status of the build, as it's still
"RUNNING", but instead of counting all failed steps so far, we can query
all failed jobs and then filter out the ones that were already retried.

Change-Id: Ib9d27587c8a8ba7970850812c4302fecdc4482e7
Reviewed-on: https://cl.tvl.fyi/c/depot/+/12233
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
-rw-r--r--ops/pipelines/static-pipeline.yaml10
1 files changed, 6 insertions, 4 deletions
diff --git a/ops/pipelines/static-pipeline.yaml b/ops/pipelines/static-pipeline.yaml
index af4f9d784e60..a7eea2eb97d8 100644
--- a/ops/pipelines/static-pipeline.yaml
+++ b/ops/pipelines/static-pipeline.yaml
@@ -88,10 +88,12 @@ steps:
     continue_on_failure: true
 
   # Exit with success or failure depending on whether any other steps
-  # failed.
+  # failed (but not retried).
   #
   # This information is checked by querying the Buildkite GraphQL API
-  # and fetching the count of failed steps.
+  # and fetching all failed steps, then filtering out the ones that were
+  # retried (retried jobs create new jobs, which would also show up in the
+  # query).
   #
   # This step must be :duck: (yes, really!) because the post-command
   # hook will inspect this name.
@@ -109,8 +111,8 @@ steps:
       readonly FAILED_JOBS=$(curl 'https://graphql.buildkite.com/v1' \
         --silent \
         -H "Authorization: Bearer $(cat ${BUILDKITE_TOKEN_PATH})" \
-        -d "{\"query\": \"query BuildStatusQuery { build(uuid: \\\"$BUILDKITE_BUILD_ID\\\") { jobs(passed: false) { count } } }\"}" | \
-        jq -r '.data.build.jobs.count')
+        -d "{\"query\": \"query BuildStatusQuery { build(uuid: \\\"$BUILDKITE_BUILD_ID\\\") { jobs(passed: false, first: 500 ) { edges { node { ... on JobTypeCommand { retried } } } } } }\"}" | \
+        jq -r '.data.build.jobs.edges | map(select(.node.retried == false)) | length')
 
       echo "$$FAILED_JOBS build jobs failed."