k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-01-23.md (about) 1 # January 23rd, 2019 Prow Outage 2 3 Created By: Benjamin Elder <bentheelder@google.com> 4 Last Modified: 2019-01-23 5 6 ## Summary 7 8 The main Kubernetes production Prow deployment, [prow.k8s.io] stopped responding 9 to GitHub events due to the webhook handler ("[hook]") being down. 10 11 This resulted in an outage of unknown length bounded to at maximum 16 hours. 12 It appears to _actually_ be bounded by around 4-5 hours based on user reports, 13 but the start time is not fully known. 14 15 ## Impact 16 17 Prow did not respond to any github events, including our various slash commands, 18 test triggering due to PR changes, etc. 19 20 Already scheduled, and time-based scheduled test running continued, 21 but using config prior to merging the non-validating config. 22 23 ## Root Cause 24 25 When the job config was [automatically updated] from a [preset refactor PR] 26 components failed to load the new configuration due to errors like: 27 28 > {"component":"hook","error":"duplicated preset label : preset-aws-credential","level":"fatal","msg":"Error starting config agent.","time":"2019-01-23T16:08:30Z"} 29 30 This is because the PR refactored some job configuration presets to use 31 a common label key with different label values. The config "agent" validation 32 rejected this format, but for some reason this was not caught in presubmit. 33 34 A [previous PR][preset config behavior] enabled this behavior but had not 35 been deployed tor prow.k8s.io yet. 36 37 At some time after this config update, the [hook] pods went into 38 CrashloopBackoff with this error, causing our webhook endpoint 39 (a GKE load balancer in front of the deployment) to serve 502s. 40 41 42 ## Detection 43 44 This was "detected" by project members noticing various failures of Prow to 45 respond to events. After informing #sig-testing and #sig-testing-ops, 46 @stevekuznetsov pinged @BenTheElder to check on this, as our oncall was 47 not online yet. 48 49 ## Resolution 50 51 Initially a revert of a related configuration change was made, manually 52 as the CI could not respond to GitHub. This was insufficient, and a follow 53 up PR refactoring the configuration to use distinct preset keys was made. 54 55 Both of these changes were manually merged due to inability to trigger the 56 presubmits. 57 58 The configuration was then manually re-synced to the cluster using a break-glass 59 [config repair script]. 60 61 62 ## Lessons Learned 63 64 ### What went well 65 66 Most prow components continued to function correctly while using the previuosly 67 valid configuration they had already loaded. Unfortunately for some reason 68 hook did not, possibly because it restarted and no longer was able to load 69 any configuration. 70 71 ### What went wrong 72 73 PR review failure + config versioning: 74 75 A PR merged that changed [preset config behavior] along with validation behavior, 76 and took advantage of this new behavior. Because we do not deploy a new prow 77 instance on every merge, this validation was now not correct for the deployed 78 prow version. 79 80 Currently the only safe way to do this would be: 81 82 - Loosen the config loading behavior in a PR, ensure that tight validation 83 matching the current Prow deployment remains in effect. This can be tricky 84 when changing that behavior. 85 - Deploy an updated Prow 86 - Loosen the config validation behavior 87 - Begin leveraging the new behavior 88 89 This should have been caught in review. 90 91 ### Where we got lucky 92 93 We already had tooling to resolve this from the previous outage 94 (the [config repair script]) 95 96 ## Action Items 97 98 - We should consider using a versioned API for Prow config, and banning 99 potentially incompatible changes from merging without moving to a new API version 100 101 - We should have alerting on config load failures. We should be able to configure 102 stackdriver logging to send an alert when these occur. 103 104 ## Timeline 105 106 All times in PST 107 108 2019-01-22 109 110 - 2:54 PM PST: [config behavior change PR][preset config behavior] merged 111 112 - 3:28 PM PST: bad config [uploaded][automatically updated] 113 114 2019-01-23 115 116 - 4:49 AM: first [recorded instance] of a user seeing Prow being unresponsive to events 117 118 - 7:10 AM: first [#testing-ops recorded instance], a channel some have notifications enabled for 119 120 - 7:18 AM: [oncall first pinged] 121 122 - 8:02 AM: @stevekuznetsov reports that GitHub webhooks appear to be working fine on their end, 123 first report that 503 responses are being served in response to our webhooks. 124 125 - 8:04 AM: @stevekuznetsov pings @BenTheElder in #testing-ops to take a look 126 127 - 8:07 AM: @BenTheElder begins looking, identifies that [hook] is in crash loop backoff 128 129 - 8:12 AM: @BentTheElder: root cause identified, revert / config fix begins, 130 configmap is rewritten with the repair script 131 132 - 8:27 AM: @BenTheElder announces resolution is confirmed complete 133 134 - 8:36 AM: kubernetes-dev mailing list is notified of the outage, fixed status, 135 and potential impact, along with links to track details. 136 137 138 ## Appendix 139 140 [Discussion in #testing-ops] 141 TODO: k-dev email 142 143 144 [prow.k8s.io]: https://prow.k8s.io 145 [hook]: https://github.com/kubernetes/test-infra/tree/master/prow/cmd/hook 146 [preset refactor PR]: https://github.com/kubernetes/test-infra/pull/10886 147 [preset config behavior]: https://github.com/kubernetes/test-infra/pull/10868/files 148 [automatically updated]: https://github.com/kubernetes/test-infra/pull/10886#issuecomment-456605785 149 [config repair script]: https://github.com/kubernetes/test-infra/blob/1cdb83860cd2e96a4da45bcf88c543530c84ffb1/experiment/maintenance/recreate_prow_configmaps.py 150 [Dicussion in #testing-ops]: https://kubernetes.slack.com/archives/C7J9RP96G/p1548256258161200 151 [recorded instance]: https://kubernetes.slack.com/archives/C09QZ4DQB/p1548247766835300 152 [#testing-ops recorded instance]: https://kubernetes.slack.com/archives/C7J9RP96G/p1548256258161200 153 [oncall first pinged]: https://kubernetes.slack.com/archives/C7J9RP96G/p1548256737161700