k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-03-26.md

k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-03-26.md (about)

     1  # March 26th, 2019 Boskos Outage
     2  
     3  All times in PDT (UTC-7).
     4  
     5  ## Summary
     6  
     7  The Boskos instance responsible for vending GCP projects to test jobs got stuck in a crash loop
     8  due to a hidden bug that was unexpectedly deployed at night. Consequently, all jobs that needed
     9  a project from boskos (including a required job for all kubernetes/kubernetes PRs) failed for
    10  just over nine hours, while on-call was sleeping.
    11  
    12  ## Impact
    13  
    14  All jobs depending on boskos failed between 2019-02-25 23:04 and 2019-02-26 08:12. As a result
    15  it was impossible to merge any PR in kubernetes/kubernetes during those nine hours, because
    16  `pull-kubernetes-e2e-gce` would always fail. There were similar effects on other projects during
    17  this time period. On `pull-kubernetes-e2e-gce` alone, 70 failures were caused by this outage.
    18  
    19  ## Root Cause
    20  
    21  A [bug was introduced](https://github.com/kubernetes/test-infra/commit/02110870507abdeb93dfd6205c70f82145f8993c#r32917192)
    22  in a commit that landed in December last year, that when deployed would cause Boskos to be unable
    23  to start up. Nobody actually attempted to deploy Boskos until last week (2019-03-20). However, while
    24  the StatefulSet that runs Boskos was successfully updated, the actual pod was not replaced due to
    25  lack of any configuration to do so, and nobody noticed the lack of functional change.
    26  
    27  At around 17:20, several hours before the outage, we started upgrading the Prow clusters from Kubernetes
    28  1.11 to 1.13. The build cluster has 150 nodes, and each update takes five minutes, so for this entire
    29  process to complete would take about ten hours. Several hours later, nothing looked amiss, and we
    30  went to bed.
    31  
    32  At 23:04 the node running the only Boskos pod was drained to be replaced with a 1.13 node, and boskos
    33  was moved to a new node. At this point it used the StatefulSet's spec, which specified a newer image
    34  version with the bug. As a result, boskos was never able to come up on the new node.
    35  
    36  ## Detection
    37  
    38  Users reported issues at 02:50 (via GitHub), 05:16 and 06:27 (both via Slack). While monitoring
    39  [clearly shows the failure](http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=1553576353087&to=1553615983139),
    40  no alerting caught it.
    41  
    42  ## Resolution
    43  
    44  Once the bug was found, a new image was built and deployed at 10:04. Boskos recovered immediately,
    45  and the projects left hanging open were cleaned up over the next few minutes. Tests began passing
    46  shortly thereafter.
    47  
    48  ## Lessons Learned
    49  
    50  ### What went well
    51  
    52  - Once people were around to deal with it, the problem was resolved in fairly short order.
    53  
    54  ### What went poorly
    55  
    56  - Beginning a ten-hour deploy process shortly after 5pm when we only have coverage until 6pm is probably inadvisable
    57  - Deployments should be monitored to verify that some actual deployment actually occurred (or, if that's not feasible, not performed at all)
    58  - Boskos' StatefulSet is not configured to perform updates, leading to surprising lack of action
    59  - It appears that the change committed in December was never tested
    60  - Unlike much of Prow, boskos deployments are handled manually, increasing the room for error
    61  
    62  ### Where we got lucky
    63  
    64  We didn't, really.
    65  
    66  ## Action Items
    67  
    68  - Update Boskos' configuration so deployments actually happen as expected ([#11956](https://github.com/kubernetes/test-infra/issues/11956))
    69  - Automate Boskos deploys ([#11957](https://github.com/kubernetes/test-infra/issues/11957))
    70  - Improve alerting around boskos status ([#11953](https://github.com/kubernetes/test-infra/issues/11953))
    71  
    72  ## Timeline
    73  
    74  - [12-05 16:21] Crashloop bug preventing Boskos from updating is introduced
    75  - [03-20 16:46] Boskos StatefulSet spec is updated but the pod is not updated
    76  - [03-26 17:40] Kubernetes node upgrade is initiated on the build cluster
    77  - [03-26 23:04] The node hosting boskos is drained. The replacement pod is stuck in a crash loop. (**outage starts**)
    78  - [03-26 02:50] The outage is reported on GitHub by @mborsz
    79  - [03-26 05:16] The outage is reported on Slack by @chenk008
    80  - [03-26 06:27] The outage is reported on Slack again by @liggitt
    81  - [03-26 07:16] @Katharine wakes up and notices the Slack reports
    82  - [03-26 07:37] @liggitt finds the cause of the failure
    83  - [03-26 08:10] @Katharine successfully deploys a fixed Boskos (**outage ends**)
    84  
    85  ## Apendix
    86  
    87  ### #sig-testing chat log
    88  
    89  > liggitt [6:27 AM]  
    90  > @Katharine is the "Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: Post http://boskos.test-pods.svc.cluster.local./acquire?type=gce-project&state=free&dest=busy&owner=pull-kubernetes-e2e-gce" error related to the tot -> snowflake ID change or is that unrelated? (edited) 
    91  > 
    92  > liggitt [6:27 AM]  
    93  > seeing that on all e2e-gce jobs
    94  > 
    95  > Katharine   [07:16 AM]  
    96  Haven't made the snowflake ID change yet so hope it's not preemptively related
    97  >
    98  >liggitt   [07:17 AM]  
    99  >heh, ok
   100  >
   101  >liggitt   [07:17 AM]  
   102  >"winter is coming"
   103  >
   104  >Katharine   [07:20 AM]  
   105  >That is a lot of failures
   106  >
   107  >Katharine   [07:20 AM]  
   108  >```Post http://boskos.test-pods.svc.cluster.local./acquire?type=gce-project&state=free&dest=busy&owner=pull-kubernetes-e2e-gce: dial tcp 10.63.250.132:80: i/o timeout```
   109  >
   110  >Katharine   [07:21 AM]  
   111  >Looks like boskos died completely shortly before midnight PT
   112  >
   113  >liggitt   [07:23 AM]  
   114  >ah
   115  >
   116  >Katharine   [07:24 AM]  
   117  >Strongly suspect the k8s upgrade we did yesterday evening
   118  >
   119  >Katharine   [07:25 AM]  
   120  >```{"error":"CustomResourceDefinition.apiextensions.k8s.io \"resources.boskos.k8s.io\" is invalid: [spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion(nil): must have exactly one version marked as storage version, status.storedVersions: Invalid value: []string(nil): must have at least one stored version]","level":"fatal","msg":"unable to create a CRD client","time":"2019-03-26T09:44:13Z"}```
   121  >
   122  >liggitt   [07:26 AM]  
   123  >did the CRD manifest just change?
   124  
   125  >Katharine   [3 hours ago]  
   126  >Not that I know of but there have been a few boskos-adjacent changes lately. Plus a k8s upgrade from 1.11 to 1.13
   127  >
   128  >Katharine   [3 hours ago]  
   129  >I've gotta go for a few minutes
   130  >
   131  >Katharine   [3 hours ago]  
   132  >Also -> #testing-ops
   133  
   134  ### #testing-ops chat log
   135  
   136  >Katharine [7:22 AM]  
   137  >Boskos seems to have gone down  
   138  Looks like it's complaining about CRDs  
   139  Possibly maybe related to that k8s upgrade?  
   140  Timing is off though  
   141  ```{"error":"CustomResourceDefinition.apiextensions.k8s.io \"resources.boskos.k8s.io\" is invalid: [spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion(nil): must have exactly one version marked as storage version, status.storedVersions: Invalid value: []string(nil): must have at least one stored version]","level":"fatal","msg":"unable to create a CRD client","time":"2019-03-26T09:44:13Z"}```  
   142  Be right back
   143  >
   144  >liggitt [7:28 AM]  
   145  joined #testing-ops along with Draven.
   146  >
   147  >liggitt [7:37 AM]  
   148  looks like https://github.com/kubernetes/test-infra/commit/02110870507abdeb93dfd6205c70f82145f8993c#r32917192
   149  >
   150  >Katharine [7:39 AM]  
   151  That took a remarkably long time to show up
   152  >
   153  >liggitt [7:40 AM]  
   154  wouldn't have noticed until it got deployed on a new cluster with the CRD missing (edited) 
   155  >
   156  >Katharine [7:40 AM]  
   157  yup
   158  >
   159  >liggitt [7:40 AM]  
   160  clearly, we need CI CI :-D
   161  >
   162  >chenk008 [7:41 AM]  
   163  joined #testing-ops.
   164  >
   165  >dims [7:41 AM]  
   166  turtles all the way down :slightly_smiling_face:
   167  >
   168  >liggitt [7:49 AM]  
   169  I opened https://github.com/kubernetes/test-infra/pull/11947  
   170  I'm not sure what else is required to push a new version of boskos live, so feel free to close that if there's a more complete fix needed
   171  >Katharine [7:52 AM]  
   172  I'm working on it  
   173  unfortunately I'm working on it from a car with a laptop at 5% battery and no physical gnubby so it's an adventure  
   174  (TouchID saves the day, mostly)
   175  >
   176  >liggitt [7:53 AM]  
   177  :car: :fearful: :battery:
   178  >
   179  >Katharine [7:57 AM]  
   180  Made it to the office!
   181  >
   182  >Katharine [8:10 AM]  
   183  we should have a boskos  
   184  yup  
   185  hopefully it recovers  
   186  looks like the janitors are cleaning up the mess and we should be back soon enough  
   187  ...meanwhile I'm going to commit these changes.  
   188  Thanks for finding the bug @liggitt!  
   189  https://github.com/kubernetes/test-infra/pull/11949 could use a [retroactive] lgtm from anyone who feels like it
   190  >
   191  >dims [8:20 AM]  
   192  :lgtm:
   193  >
   194  >Katharine [8:21 PM]  
   195  Well that was an uncomfortably exciting morning.
   196  
   197  ## #testing-ops postmortem thread
   198  
   199  >fejta   [8:23 AM]  
   200  Is this related to the cluster upgrade??
   201  >
   202  >Katharine   [8:23 AM]  
   203  My best guess is yes
   204  >
   205  >Katharine   [8:24 AM]  
   206  There was a boskos bug introduced in December that didn't matter until 11pm last night
   207  >
   208  >liggitt   [8:24 AM]  
   209  affected creation of the boskos CRD in new clusters (edited)
   210  >
   211  >liggitt   [8:24 AM]  
   212  and apparently there wasn't a new cluster between December and last night
   213  >
   214  >fejta   [8:40 AM]  
   215  what do you mean by new cluster?
   216  >
   217  >fejta   [8:40 AM]  
   218  I guess upgrading a cluster constitutes a new cluster?
   219  >
   220  >liggitt   [8:40 AM]  
   221  a cluster which did not already have the boskos CRD installed (edited)
   222  >
   223  >fejta   [8:40 AM]  
   224  I upgraded the service and build clusters yesterday, but didn't create any
   225  >
   226  >liggitt   [8:41 AM]  
   227  actually, looks like it would have failed even on existing clusters
   228  >
   229  >liggitt   [8:41 AM]  
   230  so we were using the pre-december version of boskos prior to yesterday?
   231  >
   232  >fejta   [8:41 AM]  
   233  um, no?
   234  >
   235  >Katharine   [8:42 AM]  
   236  No, we were using one from last week
   237  >
   238  >fejta   [8:42 AM]  
   239  The only thing I did was click the upgrade cluster button on the GKE cloud console
   240  >
   241  >Katharine   [8:42 AM]  
   242  And we were still using it until I changed it this morning
   243  >
   244  >fejta   [8:42 AM]  
   245  (also the cluster's nodepools) (edited)
   246  >
   247  >Katharine   [8:42 AM]  
   248  It looks like it got spun up on another node when the one it was on was drained and then crash looped forever
   249  >
   250  >fejta   [8:43 AM]  
   251  maybe something changed between 1.11 and 1.13?
   252  >
   253  >fejta   [8:43 AM]  
   254  in how a bad CRD is handled?
   255  >
   256  >liggitt   [8:44 AM]  
   257  no... the thing it was trying to create would never have been valid
   258  >
   259  >liggitt   [8:44 AM]  
   260  so https://github.com/kubernetes/test-infra/pull/11576/files#diff-7eea92bcdf0c3e4ba5ad151d80562646L89 bumped boskos last week
   261  >
   262  >liggitt   [8:44 AM]  
   263  from a version that would have worked (pre-december) to a version that did not work
   264  >
   265  >liggitt   [8:44 AM]  
   266  and that new version was live and effective?
   267  >
   268  >liggitt   [8:45 AM]  
   269  (I don't know how those manifest changes get rolled out)
   270  >
   271  >fejta   [8:45 AM]  
   272  ideally we should make them auto-deploy like the rest of prow on merge
   273  >
   274  >Katharine   [8:45 AM]  
   275  In boskos' case, manually
   276  >
   277  >liggitt   [8:45 AM]  
   278  is it possible an old working version (boskos:v20180405-12e892d69) was still running until the nodes got drained last night as part of the upgrade? (edited)
   279  >
   280  >liggitt   [8:45 AM]  
   281  ah
   282  >
   283  >fejta   [8:46 AM]  
   284  what would happen if we applied the change last week that was bad?
   285  >
   286  >Katharine   [8:46 AM]  
   287  It's possible someone tried to bump it but it wasn't actually bumped and just kept running an old version
   288  >
   289  >Katharine   [8:46 AM]  
   290  Then when the one pod it has was drained anyway it tried bringing up a new one
   291  >
   292  >fejta   [8:47 AM]  
   293  Is the fact that boskos is a statefulset relevant?
   294  >
   295  >fejta   [8:47 AM]  
   296  (why is it a statefulset? how does applying changes to a statefulset differ from a deployment?)
   297  >
   298  >Katharine   [9:09 AM]  
   299  Checked the audit logs, pod was indeed still running `gcr.io/k8s-testimages/boskos:v20180405-12e892d69` until yesterday despite the statefulset being changed.
   300  >
   301  >fejta   [9:11 AM]  
   302  IMO we should switch it to a deployment
   303  >
   304  >fejta   [9:11 AM]  
   305  I don't think it is stateful so much as a deployment with a PVC
   306  >
   307  >Katharine   [9:12 AM]  
   308  That is also what it looks like to me but Seb probably has the context there
   309  >
   310  >Katharine   [9:17 AM]  
   311  Audit logs also indicate that we tried to update it last week, but nothing happened
   312  >
   313  >Katharine   [9:18 AM]  
   314  so that completes the story