k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-02-05.md (about) 1 # February 5th, 2019 Partial Prow Outage 2 3 Created By: [Katharine Berry] 4 Last Modified: 2019-02-05 5 6 ## Summary 7 8 The main Kubernetes production Prow cluster, [prow.k8s.io], became unable to correctly 9 execute jobs due to a change that resulted in the pod namespace being ignored. Additionally, 10 had jobs been running, many PRs would have been unable to merge due to outdated GitHub branch 11 protection settings. This resulted in a 47 minute period during which nothing could merge. 12 13 ## Impact 14 15 Most jobs started by prow failed to complete initialisation, so the jobs appeared to be hung. 16 Additionally, prow did not report statuses for optional jobs, but the GitHub status continued to 17 expect them. The outage lasted for 47 minutes, during which time 415 jobs were unable to start. 18 19 Both issues was resolved after 47 minutes, and all stuck jobs were restarted correctly. Jobs 20 with missing statuses did not have the statuses applied, and so will be unable to resolve until the 21 PRs are corrected on a case-by-case basis. 22 23 ## Root Cause 24 25 After Prow was updated, plank began creating pods in the `default` namespace of the cluster, 26 rather than the expected `test-pods` namespace. This caused almost all pods to get stuck trying 27 to start as they attempted to access secrets that were not present in the `default` namespace. 28 This resulted in all nontrivial jobs failing. 29 30 The failure to create pods in the correct namespace was caused by [a typo] introduced during a 31 recent refactor of the kubernetes clients, causing it to read the `ProwJobNamespace` instead of the 32 `PodNamespace`. Prior deployments of the same change set these to the same value, so the error 33 wasn't noticed. 34 35 Separately, another change caused skipped jobs to no longer have a status reported indicating that 36 they had been skipped. A change to `branchprotector` was made to update the GitHub branch 37 protection configuration to match. While an updated `branchprotector` was deployed and run in 38 advance, deployment automation that runs on all commits reverted the update and an older version 39 later ran and reverted the changes. This caused PRs updated during the outage to be unable to be 40 merged even after the outage ended. 41 42 ## Detection 43 44 The `branchprotector` issue was detected within ten minutes of initial rollout due to a second PR to 45 complete the rollout getting wedged. The pod namespace issue was detected while attempting to roll 46 back, because the tests on the rollback PR were apparently stuck, and the deploy job itself was 47 wedged. 48 49 ## Resolution 50 51 Resolution was initially attempted by creating a revert PR for the original deployment PR. This PR 52 could not be automatically merged due to incorrect branch protection settings, so it was merged 53 manually. Prow was then unable to deploy itself, because the deployment job was in the 54 wrong namespace and so stuck trying to mount missing secrets. Prow was manually deployed 55 using `prow/deploy.sh`, and service was resumed shortly thereafter. 56 57 The 415 spurious pods created during the outage were later manually cleaned up using 58 `kubectl delete pods -l created-by-prow=true`. 59 60 ## Lessons learned 61 62 ### What went well 63 64 - Most of prow continued to function while job execution was unavailable 65 - Stuck jobs were automatically rerun after the revert successfully deployed. 66 67 ### What went poorly 68 69 - Breaking something that approves and then deploys itself makes rolling back trickier. 70 - The strong ordering requirement (branchprotector must be deployed at least four hours before anything else) 71 added complexity to the rollout, which failed. 72 - PRs that are missing "optional" statuses did not automatically recover after the outage and must 73 manually kicked to recover them. 74 - Merging multiple major changes simultaneously is less than ideal. 75 76 ### Where we got lucky 77 78 - Apparently nobody was watching their PRs very closely around 9am PST, so nobody noticed. 79 - It appears that very few PRs have missing statuses 80 81 ## Action items 82 83 - Before retrying, ensure the updated branchprotector run is not reverted: [#11134] 84 85 ## Timeline 86 87 All times in PST, on February 5th 2019 88 89 * [08:47] Prow's auto-generated bump PR is merged, triggering a deploy. **Outage begins.** 90 * [08:57] @Katharine notices that some GitHub statuses that should've been removed earlier are still 91 present and thereby blocking any PR that omits a conditionally triggered status 92 * [09:02] @Katharine creates a PR to roll back the change pending investigation 93 * [09:12] @Katharine manually merges the rollback PR, which is itself blocked on missing status 94 contexts. 95 * [09:19] @Katharine begins suspecting that prow is unable to deploy itself for some reason 96 * [09:27] @Katharine determines that the deploy job was created with the wrong namespace 97 * [09:29] @Katharine manually rolls back prow. 98 * [09:34] Prow is confirmed to be functional. **Outage ends.** 99 * [10:01] @Katharine verifies that all pods created during the outage are in the wrong namespace 100 * [10:33] @stevekuznetsov finds and fixes the typo causing prow to use the wrong namespace 101 * [10:45] @Katharine deletes all the stray pods that are still trying to start. 102 * [14:31] @Katharine determines why an older branchprotector reverted the required branch protection 103 changes. 104 105 ## Appendix 106 107 ### #testing-ops chat logs 108 109 Edited for clarity to remove discussion of an unrelated openshift prow outage 110 111 --- 112 113 >Katharine [8:27 AM] 114 It's prow bump time! 115 For real, probably! Unlike the last several times. 116 > 117 >stevekuznetsov [8:33 AM] 118 dope 119 let's do it 120 > 121 >Katharine [8:57 AM] 122 hmmmmmm 123 not sure the branchprotector run actually had the desired effect. 124 ugh, this is not right 125 rolling back 126 > 127 >bentheelder [9:03 AM] 128 shipit: :ship_it_parrot: 129 rollback: :rollback_parrot: (edited) 130 > 131 >stevekuznetsov [9:04 AM] 132 lol 133 what happened? 134 > 135 >Katharine [9:05 AM] 136 Unclear 137 > 138 >stevekuznetsov [9:05 AM] 139 did another BP run happen between your manual one and right now? 140 what caused you to panic? 141 tide logs? 142 > 143 >Katharine [9:05 AM] 144 I updated the cronjob too so it shouldn't have mattered 145 ¯\\\_(ツ)\_/¯ 146 > 147 >stevekuznetsov [9:05 AM] 148 ok, true 149 ok 150 rope me in as much as you want on the debugging 151 happy to jump into a hangouts session 152 just lmk 153 > 154 >Katharine [9:06 AM] 155 It's possible I'm just confused by what the expected statuses are, but I'm guessing that we shouldn't be seeing "required" statuses that are never reported. 156 (and which are not always run) 157 > 158 >stevekuznetsov [9:07 AM] 159 YUes 160 > 161 >bentheelder [9:07 AM] 162 actually you can have required for `run_if_changed` (?) 163 > 164 >stevekuznetsov [9:07 AM] 165 If the job posts a status conditionally we should not see a required status for it 166 No, not anymore 167 `run_if_changed` will be posted _if_ it runs 168 > 169 >Katharine [9:07 AM] 170 That is what made me panic 171 > 172 >stevekuznetsov [9:07 AM] 173 You should not see required statuses in branch protection for jobs that are `run_if_changed` (or not `always_run` in general) 174 > 175 >bentheelder [9:07 AM] 176 wait, so how do you require a job that is not always_run? previously you could require a job to be (run && passed) || skipped (edited) 177 > 178 >stevekuznetsov [9:07 AM] 179 Right, so there are no longer any skipped statuses 180 > 181 >prow APP [9:12 AM] 182 *Warning:* Katharine (@Katharine) manually merged 2 commit(s) into master: https://github.com/kubernetes/test-infra/compare/8c1cd8046656...4956da7c3699 183 > 184 >bentheelder [9:14 AM] 185 (prow was stuck, rollback of the bump) 186 > 187 >Katharine [9:19 AM] 188 I'm not convinced that prow is actually in a sufficiently healthy state to roll itself back. 189 > 190 >bentheelder [9:19 AM] 191 we can manually run the yamls 192 > 193 >Katharine [9:19 AM] 194 Suspect we will have to 195 Trying to verify first 196 > 197 >Katharine [9:27 AM] 198 further update: it looks like pods are landing in the wrong namespace 199 which is why things are extremely broken now 200 > 201 >stevekuznetsov [9:27 AM] 202 ... why? 203 > 204 >Katharine [9:28 AM] 205 Unclear 206 but nothing can mount anything and is consequently getting stuck (edited) 207 because they're in the wrong namespaces 208 > 209 >bentheelder [9:28 AM] 210 I'd guess the client code change didn't go smooth for our instance. 211 > 212 >stevekuznetsov [9:29 AM] 213 do you set PodNamespace or ProwJobNamespace in your config? 214 are you seeing pods in the build cluster in the wrong namespace 215 or prow infra service pods? 216 > 217 >Katharine [9:30 AM] 218 prowjob pods 219 > 220 >stevekuznetsov [9:31 AM] 221 pods created by `plank` to target the build cluster 222 and fulfill `prowjobs`? 223 > 224 >Katharine [9:31 AM] 225 At least the trusted ones, which was my immediate concern because it is why it can't roll itself back 226 > 227 >stevekuznetsov [9:31 AM] 228 those are not on the build cluster 229 but on the local cluster? 230 > 231 >Katharine [9:31 AM] 232 Yes 233 In their own namespace 234 > 235 >stevekuznetsov [9:31 AM] 236 can you paste plank logs from your deploymetn 237 at the start? 238 > 239 >Katharine [9:32 AM] 240 but they are instead ending up in the same one as the service pods 241 > 242 >stevekuznetsov [9:32 AM] 243 hrm 244 > 245 >Katharine [9:32 AM] 246 gimme a bit 247 trying to get it back up first 248 > 249 >stevekuznetsov [9:32 AM] 250 ok 251 i will look at the code 252 > 253 >Katharine [9:34 AM] 254 okay, manual rollback worked 255 > 256 >stevekuznetsov [9:35 AM] 257 how were you specifying which namespace would go with the trusted namespace? 258 it's not clear to me how you schedule pods explicitly to something other than `config.PodNamespace` 259 > 260 >Katharine [9:37 AM] 261 I think we're going to have to do something to recover all the jobs that got wedged. Or just tell people to retest I guess. 262 > 263 >Katharine [9:39 AM] 264 the stuck jobs are recovering on their own, so that's good. 265 > 266 >stevekuznetsov [9:41 AM] 267 yeah as long as the ProwJob record exists 268 `plank` will recover it 269 > 270 >Katharine [9:41 AM] 271 I thought so, but they were being slow and I was mildly concerned 272 > 273 >stevekuznetsov [9:41 AM] 274 ack 275 > 276 >Katharine [10:01 AM] 277 verified: for both the trusted cluster and the build cluster, all the plank-created pods got dropped in the `default` namespace instead of `test-pods` 278 (they end up in the correct cluster though) 279 > 280 >stevekuznetsov [10:31 AM] 281 what is your PodNamespace in the config? 282 > 283 >Katharine [10:32 AM] 284 `test-pods`: https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L41 285 > 286 >stevekuznetsov [10:32 AM] 287 very interesting 288 oh 289 lmao 290 :facepalm: 291 https://github.com/kubernetes/test-infra/pull/11130 292 now 293 that will schedule to `test-pods` 294 but 295 did you have any clusters with more than one NS? 296 or is it always `test-pods` across all clusters 297 > 298 >Katharine [10:35 AM] 299 I'm unsure. To my knowledge it's just `test-pods`, but I may be wrong. 300 > 301 >stevekuznetsov [10:36 AM] 302 then that PR should fix you 303 > 304 >Katharine [11:09 AM] 305 I am still attempting to figure out why the branchprotector failed to drop required statuses 306 > 307 >Katharine [11:17 AM] 308 upon investigation, it looks like an old branchprotector ended up running again, though I'm not clear how or why. 309 > 310 >Katharine [11:27 AM] 311 I've now re-updated the cronjob, suspended it, manually invoked branchprotector, and verified the correct version is running. Assuming it does something reasonable, we can try this exciting adventure again at 3pm. 312 (which is roughly when it will finish) 313 314 315 [Katharine Berry]: https://github.com/Katharine 316 [a typo]: https://github.com/stevekuznetsov/test-infra/commit/d6078ec1b6ac6b79abaf7b1491a386ffb6cc93c2#diff-c3c219515b589f861f234b9c4074b769R122 317 [prow.k8s.io]: https://prow.k8s.io 318 [#11134]: https://github.com/kubernetes/test-infra/pull/11134