k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-01-23.md

k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-01-23.md (about)

     1  # January 23rd, 2019 Prow Outage
     2  
     3  Created By: Benjamin Elder <bentheelder@google.com>
     4  Last Modified: 2019-01-23
     5  
     6  ## Summary
     7  
     8  The main Kubernetes production Prow deployment, [prow.k8s.io] stopped responding
     9  to GitHub events due to the webhook handler ("[hook]") being down.
    10  
    11  This resulted in an outage of unknown length bounded to at maximum 16 hours.
    12  It appears to _actually_ be bounded by around 4-5 hours based on user reports,
    13  but the start time is not fully known.
    14  
    15  ## Impact
    16  
    17  Prow did not respond to any github events, including our various slash commands,
    18  test triggering due to PR changes, etc.
    19  
    20  Already scheduled, and time-based scheduled test running continued, 
    21  but using config prior to merging the non-validating config.
    22  
    23  ## Root Cause
    24  
    25  When the job config was [automatically updated] from a [preset refactor PR] 
    26  components failed to load the new configuration due to errors like:
    27  
    28  > {"component":"hook","error":"duplicated preset label : preset-aws-credential","level":"fatal","msg":"Error starting config agent.","time":"2019-01-23T16:08:30Z"}
    29  
    30  This is because the PR refactored some job configuration presets to use
    31  a common label key with different label values. The config "agent" validation
    32  rejected this format, but for some reason this was not caught in presubmit.
    33  
    34  A [previous PR][preset config behavior] enabled this behavior but had not 
    35  been deployed tor prow.k8s.io yet.
    36  
    37  At some time after this config update, the [hook] pods went into 
    38  CrashloopBackoff with this error, causing our webhook endpoint 
    39  (a GKE load balancer in front of the deployment) to serve 502s.
    40  
    41  
    42  ## Detection
    43  
    44  This was "detected" by project members noticing various failures of Prow to 
    45  respond to events. After informing #sig-testing and #sig-testing-ops,
    46  @stevekuznetsov pinged @BenTheElder to check on this, as our oncall was
    47  not online yet.
    48  
    49  ## Resolution
    50  
    51  Initially a revert of a related configuration change was made, manually
    52  as the CI could not respond to GitHub. This was insufficient, and a follow
    53  up PR refactoring the configuration to use distinct preset keys was made.
    54  
    55  Both of these changes were manually merged due to inability to trigger the
    56  presubmits.
    57  
    58  The configuration was then manually re-synced to the cluster using a break-glass
    59  [config repair script].
    60  
    61  
    62  ## Lessons Learned
    63  
    64  ### What went well
    65  
    66  Most prow components continued to function correctly while using the previuosly
    67  valid configuration they had already loaded. Unfortunately for some reason
    68  hook did not, possibly because it restarted and no longer was able to load
    69  any configuration.
    70  
    71  ### What went wrong
    72  
    73  PR review failure + config versioning:
    74  
    75  A PR merged that changed [preset config behavior] along with validation behavior,
    76  and took advantage of this new behavior. Because we do not deploy a new prow
    77  instance on every merge, this validation was now not correct for the deployed
    78  prow version.
    79  
    80  Currently the only safe way to do this would be:
    81  
    82   - Loosen the config loading behavior in a PR, ensure that tight validation
    83     matching the current Prow deployment remains in effect. This can be tricky
    84     when changing that behavior.
    85   - Deploy an updated Prow
    86   - Loosen the config validation behavior
    87   - Begin leveraging the new behavior
    88  
    89  This should have been caught in review.
    90  
    91  ### Where we got lucky
    92  
    93  We already had tooling to resolve this from the previous outage 
    94  (the [config repair script])
    95  
    96  ## Action Items
    97  
    98  - We should consider using a versioned API for Prow config, and banning 
    99  potentially incompatible changes from merging without moving to a new API version
   100  
   101  - We should have alerting on config load failures. We should be able to configure
   102    stackdriver logging to send an alert when these occur.
   103  
   104  ## Timeline
   105  
   106  All times in PST
   107  
   108  2019-01-22
   109  
   110  - 2:54 PM PST: [config behavior change PR][preset config behavior] merged
   111  
   112  - 3:28 PM PST: bad config [uploaded][automatically updated]
   113  
   114  2019-01-23
   115  
   116  - 4:49 AM: first [recorded instance] of a user seeing Prow being unresponsive to events
   117  
   118  - 7:10 AM: first [#testing-ops recorded instance], a channel some have notifications enabled for
   119  
   120  - 7:18 AM: [oncall first pinged]
   121  
   122  - 8:02 AM: @stevekuznetsov reports that GitHub webhooks appear to be working fine on their end,
   123             first report that 503 responses are being served in response to our webhooks.
   124  
   125  - 8:04 AM: @stevekuznetsov pings @BenTheElder in #testing-ops to take a look
   126  
   127  - 8:07 AM: @BenTheElder begins looking, identifies that [hook] is in crash loop backoff
   128  
   129  - 8:12 AM: @BentTheElder: root cause identified, revert / config fix begins, 
   130             configmap is rewritten with the repair script
   131  
   132  - 8:27 AM: @BenTheElder announces resolution is confirmed complete
   133  
   134  - 8:36 AM: kubernetes-dev mailing list is notified of the outage, fixed status,
   135             and potential impact, along with links to track details.
   136  
   137  
   138  ## Appendix
   139  
   140  [Discussion in #testing-ops]
   141  TODO: k-dev email
   142  
   143  
   144  [prow.k8s.io]: https://prow.k8s.io
   145  [hook]: https://github.com/kubernetes/test-infra/tree/master/prow/cmd/hook
   146  [preset refactor PR]: https://github.com/kubernetes/test-infra/pull/10886
   147  [preset config behavior]: https://github.com/kubernetes/test-infra/pull/10868/files
   148  [automatically updated]: https://github.com/kubernetes/test-infra/pull/10886#issuecomment-456605785
   149  [config repair script]: https://github.com/kubernetes/test-infra/blob/1cdb83860cd2e96a4da45bcf88c543530c84ffb1/experiment/maintenance/recreate_prow_configmaps.py
   150  [Dicussion in #testing-ops]: https://kubernetes.slack.com/archives/C7J9RP96G/p1548256258161200
   151  [recorded instance]: https://kubernetes.slack.com/archives/C09QZ4DQB/p1548247766835300
   152  [#testing-ops recorded instance]: https://kubernetes.slack.com/archives/C7J9RP96G/p1548256258161200
   153  [oncall first pinged]: https://kubernetes.slack.com/archives/C7J9RP96G/p1548256737161700