k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-02-05.md (about)

     1  # February 5th, 2019 Partial Prow Outage
     2  
     3  Created By: [Katharine Berry]  
     4  Last Modified: 2019-02-05
     5  
     6  ## Summary
     7  
     8  The main Kubernetes production Prow cluster, [prow.k8s.io], became unable to correctly
     9  execute jobs due to a change that resulted in the pod namespace being ignored. Additionally,
    10  had jobs been running, many PRs would have been unable to merge due to outdated GitHub branch
    11  protection settings. This resulted in a 47 minute period during which nothing could merge.
    12  
    13  ## Impact
    14  
    15  Most jobs started by prow failed to complete initialisation, so the jobs appeared to be hung.
    16  Additionally, prow did not report statuses for optional jobs, but the GitHub status continued to
    17  expect them. The outage lasted for 47 minutes, during which time 415 jobs were unable to start.
    18  
    19  Both issues was resolved after 47 minutes, and all stuck jobs were restarted correctly. Jobs
    20  with missing statuses did not have the statuses applied, and so will be unable to resolve until the
    21  PRs are corrected on a case-by-case basis.
    22  
    23  ## Root Cause
    24  
    25  After Prow was updated, plank began creating pods in the `default` namespace of the cluster,
    26  rather than the expected `test-pods` namespace. This caused almost all pods to get stuck trying
    27  to start as they attempted to access secrets that were not present in the `default` namespace.
    28  This resulted in all nontrivial jobs failing.
    29  
    30  The failure to create pods in the correct namespace was caused by [a typo] introduced during a
    31  recent refactor of the kubernetes clients, causing it to read the `ProwJobNamespace` instead of the
    32  `PodNamespace`. Prior deployments of the same change set these to the same value, so the error
    33  wasn't noticed.
    34  
    35  Separately, another change caused skipped jobs to no longer have a status reported indicating that
    36  they had been skipped. A change to `branchprotector` was made to update the GitHub branch
    37  protection configuration to match. While an updated `branchprotector` was deployed and run in
    38  advance, deployment automation that runs on all commits reverted the update and an older version
    39  later ran and reverted the changes. This caused PRs updated during the outage to be unable to be
    40  merged even after the outage ended.
    41  
    42  ## Detection
    43  
    44  The `branchprotector` issue was detected within ten minutes of initial rollout due to a second PR to
    45  complete the rollout getting wedged. The pod namespace issue was detected while attempting to roll
    46  back, because the tests on the rollback PR were apparently stuck, and the deploy job itself was
    47  wedged. 
    48  
    49  ## Resolution
    50  
    51  Resolution was initially attempted by creating a revert PR for the original deployment PR. This PR
    52  could not be automatically merged due to incorrect branch protection settings, so it was merged
    53  manually. Prow was then unable to deploy itself, because the deployment job was in the
    54  wrong namespace and so stuck trying to mount missing secrets. Prow was manually deployed
    55  using `prow/deploy.sh`, and service was resumed shortly thereafter.
    56  
    57  The 415 spurious pods created during the outage were later manually cleaned up using
    58  `kubectl delete pods -l created-by-prow=true`.
    59  
    60  ## Lessons learned
    61  
    62  ### What went well
    63  
    64  - Most of prow continued to function while job execution was unavailable
    65  - Stuck jobs were automatically rerun after the revert successfully deployed.
    66  
    67  ### What went poorly
    68  
    69  - Breaking something that approves and then deploys itself makes rolling back trickier.
    70  - The strong ordering requirement (branchprotector must be deployed at least four hours before anything else)
    71    added complexity to the rollout, which failed.
    72  - PRs that are missing "optional" statuses did not automatically recover after the outage and must
    73    manually kicked to recover them.
    74  - Merging multiple major changes simultaneously is less than ideal.
    75  
    76  ### Where we got lucky
    77  
    78  - Apparently nobody was watching their PRs very closely around 9am PST, so nobody noticed.
    79  - It appears that very few PRs have missing statuses
    80  
    81  ## Action items
    82  
    83  - Before retrying, ensure the updated branchprotector run is not reverted: [#11134]
    84  
    85  ## Timeline
    86  
    87  All times in PST, on February 5th 2019
    88  
    89  * [08:47] Prow's auto-generated bump PR is merged, triggering a deploy. **Outage begins.**
    90  * [08:57] @Katharine notices that some GitHub statuses that should've been removed earlier are still 
    91  present and thereby blocking any PR that omits a conditionally triggered status
    92  * [09:02] @Katharine creates a PR to roll back the change pending investigation
    93  * [09:12] @Katharine manually merges the rollback PR, which is itself blocked on missing status
    94  contexts.
    95  * [09:19] @Katharine begins suspecting that prow is unable to deploy itself for some reason
    96  * [09:27] @Katharine determines that the deploy job was created with the wrong namespace
    97  * [09:29] @Katharine manually rolls back prow.
    98  * [09:34] Prow is confirmed to be functional. **Outage ends.**
    99  * [10:01] @Katharine verifies that all pods created during the outage are in the wrong namespace
   100  * [10:33] @stevekuznetsov finds and fixes the typo causing prow to use the wrong namespace
   101  * [10:45] @Katharine deletes all the stray pods that are still trying to start.
   102  * [14:31] @Katharine determines why an older branchprotector reverted the required branch protection
   103  changes.
   104  
   105  ## Appendix 
   106  
   107  ### #testing-ops chat logs
   108  
   109  Edited for clarity to remove discussion of an unrelated openshift prow outage
   110  
   111  ---
   112  
   113  >Katharine [8:27 AM]  
   114  It's prow bump time!  
   115  For real, probably! Unlike the last several times.  
   116  >
   117  >stevekuznetsov [8:33 AM]  
   118  dope  
   119  let's do it  
   120  >
   121  >Katharine [8:57 AM]  
   122  hmmmmmm  
   123  not sure the branchprotector run actually had the desired effect.  
   124  ugh, this is not right  
   125  rolling back
   126  >
   127  >bentheelder [9:03 AM]  
   128  shipit: :ship_it_parrot:  
   129  rollback: :rollback_parrot: (edited) 
   130  >
   131  >stevekuznetsov [9:04 AM]  
   132  lol  
   133  what happened?
   134  >
   135  >Katharine [9:05 AM]  
   136  Unclear  
   137  >
   138  >stevekuznetsov [9:05 AM]  
   139  did another BP run happen between your manual one and right now?  
   140  what caused you to panic?  
   141  tide logs?
   142  >
   143  >Katharine [9:05 AM]  
   144  I updated the cronjob too so it shouldn't have mattered  
   145  ¯\\\_(ツ)\_/¯
   146  >
   147  >stevekuznetsov [9:05 AM]  
   148  ok, true  
   149  ok  
   150  rope me in as much as you want on the debugging  
   151  happy to jump into a hangouts session  
   152  just lmk
   153  >
   154  >Katharine [9:06 AM]  
   155  It's possible I'm just confused by what the expected statuses are, but I'm guessing that we shouldn't be seeing "required" statuses that are never reported.  
   156  (and which are not always run)
   157  >
   158  >stevekuznetsov [9:07 AM]  
   159  YUes
   160  >
   161  >bentheelder [9:07 AM]  
   162  actually you can have required for `run_if_changed` (?)
   163  >
   164  >stevekuznetsov [9:07 AM]  
   165  If the job posts a status conditionally we should not see a required status for it  
   166  No, not anymore  
   167  `run_if_changed` will be posted _if_ it runs
   168  >
   169  >Katharine [9:07 AM]  
   170  That is what made me panic
   171  >
   172  >stevekuznetsov [9:07 AM]  
   173  You should not see required statuses in branch protection for jobs that are `run_if_changed` (or not `always_run` in general)
   174  >
   175  >bentheelder [9:07 AM]  
   176  wait, so how do you require a job that is not always_run? previously you could require a job to be (run && passed) || skipped (edited) 
   177  >
   178  >stevekuznetsov [9:07 AM]  
   179  Right, so there are no longer any skipped statuses  
   180  >
   181  >prow APP [9:12 AM]  
   182  *Warning:* Katharine (@Katharine) manually merged 2 commit(s) into master: https://github.com/kubernetes/test-infra/compare/8c1cd8046656...4956da7c3699
   183  >
   184  >bentheelder [9:14 AM]  
   185  (prow was stuck, rollback of the bump)
   186  >
   187  >Katharine [9:19 AM]  
   188  I'm not convinced that prow is actually in a sufficiently healthy state to roll itself back.
   189  >
   190  >bentheelder [9:19 AM]  
   191  we can manually run the yamls
   192  >
   193  >Katharine [9:19 AM]  
   194  Suspect we will have to  
   195  Trying to verify first
   196  >
   197  >Katharine [9:27 AM]  
   198  further update: it looks like pods are landing in the wrong namespace  
   199  which is why things are extremely broken now
   200  >
   201  >stevekuznetsov [9:27 AM]  
   202  ... why?
   203  >
   204  >Katharine [9:28 AM]  
   205  Unclear  
   206  but nothing can mount anything and is consequently getting stuck (edited)   
   207  because they're in the wrong namespaces
   208  >
   209  >bentheelder [9:28 AM]  
   210  I'd guess the client code change didn't go smooth for our instance.
   211  >
   212  >stevekuznetsov [9:29 AM]  
   213  do you set PodNamespace or ProwJobNamespace in your config?  
   214  are you seeing pods in the build cluster in the wrong namespace  
   215  or prow infra service pods?
   216  >
   217  >Katharine [9:30 AM]  
   218  prowjob pods
   219  >
   220  >stevekuznetsov [9:31 AM]  
   221  pods created by `plank` to target the build cluster  
   222  and fulfill `prowjobs`?
   223  >
   224  >Katharine [9:31 AM]  
   225  At least the trusted ones, which was my immediate concern because it is why it can't roll itself back
   226  >
   227  >stevekuznetsov [9:31 AM]  
   228  those are not on the build cluster  
   229  but on the local cluster?
   230  >
   231  >Katharine [9:31 AM]  
   232  Yes  
   233  In their own namespace
   234  >
   235  >stevekuznetsov [9:31 AM]  
   236  can you paste plank logs from your deploymetn  
   237  at the start?
   238  >
   239  >Katharine [9:32 AM]  
   240  but they are instead ending up in the same one as the service pods
   241  >
   242  >stevekuznetsov [9:32 AM]  
   243  hrm
   244  >
   245  >Katharine [9:32 AM]  
   246  gimme a bit  
   247  trying to get it back up first
   248  >
   249  >stevekuznetsov [9:32 AM]  
   250  ok  
   251  i will look at the code
   252  >
   253  >Katharine [9:34 AM]  
   254  okay, manual rollback worked
   255  >
   256  >stevekuznetsov [9:35 AM]  
   257  how were you specifying which namespace would go with the trusted namespace?  
   258  it's not clear to me how you schedule pods explicitly to something other than `config.PodNamespace`
   259  >
   260  >Katharine [9:37 AM]  
   261  I think we're going to have to do something to recover all the jobs that got wedged. Or just tell people to retest I guess.
   262  >
   263  >Katharine [9:39 AM]  
   264  the stuck jobs are recovering on their own, so that's good.
   265  >
   266  >stevekuznetsov [9:41 AM]  
   267  yeah as long as the ProwJob record exists  
   268  `plank` will recover it
   269  >
   270  >Katharine [9:41 AM]  
   271  I thought so, but they were being slow and I was mildly concerned
   272  >
   273  >stevekuznetsov [9:41 AM]  
   274  ack
   275  >
   276  >Katharine [10:01 AM]  
   277  verified: for both the trusted cluster and the build cluster, all the plank-created pods got dropped in the `default` namespace instead of `test-pods`  
   278  (they end up in the correct cluster though)
   279  >
   280  >stevekuznetsov [10:31 AM]  
   281  what is your PodNamespace in the config?
   282  >
   283  >Katharine [10:32 AM]  
   284  `test-pods`: https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L41
   285  >
   286  >stevekuznetsov [10:32 AM]  
   287  very interesting  
   288  oh  
   289  lmao  
   290  :facepalm:  
   291  https://github.com/kubernetes/test-infra/pull/11130  
   292  now  
   293  that will schedule to `test-pods`  
   294  but  
   295  did you have any clusters with more than one NS?  
   296  or is it always `test-pods` across all clusters  
   297  >
   298  >Katharine [10:35 AM]  
   299  I'm unsure. To my knowledge it's just `test-pods`, but I may be wrong.
   300  >
   301  >stevekuznetsov [10:36 AM]  
   302  then that PR should fix you
   303  >
   304  >Katharine [11:09 AM]  
   305  I am still attempting to figure out why the branchprotector failed to drop required statuses
   306  >
   307  >Katharine [11:17 AM]  
   308  upon investigation, it looks like an old branchprotector ended up running again, though I'm not clear how or why.
   309  >
   310  >Katharine [11:27 AM]  
   311  I've now re-updated the cronjob, suspended it, manually invoked branchprotector, and verified the correct version is running. Assuming it does something reasonable, we can try this exciting adventure again at 3pm.  
   312  (which is roughly when it will finish)
   313  
   314  
   315  [Katharine Berry]: https://github.com/Katharine
   316  [a typo]: https://github.com/stevekuznetsov/test-infra/commit/d6078ec1b6ac6b79abaf7b1491a386ffb6cc93c2#diff-c3c219515b589f861f234b9c4074b769R122
   317  [prow.k8s.io]: https://prow.k8s.io
   318  [#11134]: https://github.com/kubernetes/test-infra/pull/11134