k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-05-30.md (about)

     1  # May 30th, 2019 Prow Flakiness
     2  
     3  Created by: [Katharine Berry](https://github.com/Katharine)
     4  
     5  All times PDT.
     6  
     7  ## Summary
     8  
     9  Increasing test load over time has lead to increased use of I/O resources by the k8s-prow-build
    10  cluster. In the run-up to the Kubernetes 1.15 code freeze, a spike in usage lead to hitting hard
    11  throttling limits, causing I/O operations to take so long that nodes experienced kernel panics
    12  and test code wildly misbehaved.
    13  
    14  ## Impact
    15  
    16  Many tests were failing in surprising ways, or being aborted partway through execution for no clear
    17  reason. Artifacts were frequently not uploaded, causing Spyglass and Testgrid to be unable to see
    18  that the job had ever finished running. The problem lasted for at least seven hours. Due to PRs
    19  failing to merge as expected, 1.15 code freeze slipped a day.
    20  
    21  Due to an unrelated monitoring failure, it is difficult to tell exactly what the problem first
    22  became substantial. At peak, 60 PRs were stuck in the merge queue, and no PRs at all had merged for many hours.
    23  
    24  ## Root Cause
    25  
    26  Prow jobs are run in a 100-node cluster. Each node has eight CPUs, 52 GB of RAM, and 250 GB of disk
    27  space. On GCP, IOPS are limited proportionally to disk size. With our 250 GB disks, that gave us
    28  187.5 IOPS per instance. While the majority of our nodes are not touching the disk at all most of
    29  the time, nodes under certain workloads would spike up to large multiples of that:
    30  
    31  ![](./images/2019-05-30/read-iops-12h.png)
    32  
    33  The affected nodes suffered assorted failure modes, which ranged from pods failing, to kubelet
    34  crashes, to kernel panics. While only a minority of nodes were affected at any given time, the
    35  large number of pods that must successfully complete for a job to merge effectively brought us to
    36  a standstill.
    37  
    38  ## Detection
    39  
    40  Multiple users reported jobs that were reported to GitHub as having failed, but which upon
    41  checking the job status actually appeared to still be running. This behaviour is caused by pods
    42  exiting without successfully uploading `finished.json`, which most of our tooling uses to
    43  determine job state.
    44  
    45  Due to an unrelated failure, most prow/tide monitoring for the week leading to this incident is
    46  permanently lost, making it difficult to establish how long the cluster was faulty. 
    47  
    48  ## Resolution
    49  
    50  We requested and received quota for additional disk space and migrated the cluster to 500 GB disks
    51  instead of 250 GB disks, with a corresponding increase in available IOPS. The cluster became
    52  healthier once the migration was completed.
    53  
    54  ## Lessons Learned
    55  
    56  ### What went well
    57  
    58  Not much.
    59  
    60  ### What went poorly
    61  
    62  * Prow metrics were not being recorded in the runup to the incident, making it difficult to
    63   figure out what was going on.
    64  * Alerting for these failures either does not exist or does not work.
    65  * Once the cause had been determined and a fix devised, nothing was done for another two hours.
    66  * The incident caused 1.15 code freeze to slip
    67  
    68  ### Where we got lucky
    69  
    70  * Someone who realised that IOPS were the problem happened to be around to take a look.
    71  
    72  ## Action items
    73  
    74  * Create and/or fix monitoring so that when tide stops doing things the oncall gets paged
    75    ([#12829](https://github.com/kubernetes/test-infra/issues/12829))
    76  * Establish a process and guidelines for adding new members to the oncall rotation so they are not
    77    thrown in the deep end without any preparation or extra support.
    78  
    79  ## Timeline
    80  
    81  * [???] IOPS reached a critical point such that cluster availability began degrading
    82  * [13:26] @cblecker reports on Slack that something is wrong with Prow
    83  * [14:11] @Katharine determines that many things are wrong with the Kubernetes nodes and the test pods
    84  * [14:17] @BenTheElder loops in @Random-Liu
    85  * [14:43] @Random-Liu determines the root cause is severe I/O throttling
    86  * [15:02] The current oncaller is given directions on increasing capacity and requests additional disk quota
    87  * [15:14] Additional disk quota is granted
    88  * [17:10] @Katharine realises that capacity was not increased and brings up 55 nodes with 500 GB disks
    89  * [17:41] @BenTheElder begins gradually migrating nodes to the 500GB node pool
    90  * [23:55] @BenTheElder completes the migration to the new node pool, all nodes now have 500 GB disks
    91  * [07:00] Prow is once again merging at its usual rate
    92  
    93  ## Appendinx
    94  
    95  ### #testing-ops chat log
    96  
    97  >cblecker [1:26 PM]  
    98  @test-infra-oncall there's something broken. I'm seeing failures being reported to GitHub, but Deck says jobs are still running. Seen this on a couple PRs now
    99  >
   100  >tony-yang [1:27 PM]  
   101  Thanks, will take a look
   102  >
   103  >tony-yang [1:50 PM]  
   104  @cblecker we don't see any significant red flag at the moment. Unfortunately, the job you provided is too old for us to obtain any useful information. It seems that there might be a transient issue with the job that uploads the finish state to Prow.  
   105  If you have a more recent job that ended in similar state (within the last 15 min), we can take a look  
   106  Thanks  
   107  >
   108  >cblecker [1:55 PM]  
   109  @tony-yang sounds good. if I see it again, will let you know
   110  >
   111  >Katharine [1:57 PM]  
   112  I can see a big burst of failures 1-3 hours ago  
   113  Can't find anything more recent  
   114  ooh I found one I think  
   115  >    ```
   116  >    State:          Terminated
   117  >      Exit Code:    0
   118  >      Started:      Mon, 01 Jan 0001 00:00:00 +0000
   119  >      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
   120  >    ```
   121  >what even is metadata (edited) 
   122  >
   123  >cblecker [2:07 PM]  
   124  haha.. well that explains why Deck is all ¯\\\_(ツ)\_/¯
   125  >
   126  >Katharine [2:11 PM]  
   127  upon further investigation, there are a bunch of pods and nodes in very questionable states  
   128  Kubernetes is happening  
   129  And it looks like pods are being evicted by something that is not us  
   130  >
   131  >cblecker [2:12 PM]  
   132  oh that's always fun
   133  >
   134  >Katharine [2:13 PM]  
   135  Also not super convinced that this one job pod I see that it has tried to kill 2,420 times over the last four days is doing what it's supposed to.  
   136  >
   137  >Katharine [2:46 PM]  
   138  everything is on fiiiiire  
   139  :fire:  
   140  (we've managed to exhaust our IOPS quota... by a factor of two.)  
   141  stuff should probably still be mostly working actually.  
   142  but flakes abound
   143  >
   144  >tony-yang [2:47 PM]  
   145  We noticed that we have used 2x disk IO, which caused the slow down we were experiencing earlier. We are actively working to on an upgrade plan and will notify the community when we will start the maintenance. Thanks
   146  >
   147  >cblecker [3:34 PM]  
   148  Thank you both :pray:
   149  >
   150  >alejandrox1 [4:04 PM]  
   151  hi all, is tide working?  
   152  i’ve seen a couple prs that have been lgtm’d, approved, and all tests passed (even part of the v1.15 milestone) and they haven’t been merged in 12+ hrs
   153  >
   154  >liggitt [4:05 PM]  
   155  Last merge was 5 hours ago. https://github.com/kubernetes/kubernetes/commits/master  
   156  The queue is long
   157  >
   158  >liggitt [4:05 PM]  
   159  https://prow.k8s.io/tide
   160  >
   161  >liggitt [4:05 PM]  
   162  and CI has been very red, which means it drains slowly
   163  >
   164  >cjwagner [4:09 PM]  
   165  Yeah AFAICT Tide is working properly, the problem is that the nodes are on fire. See the batch jobs: https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=batch
   166  >
   167  >cblecker [5:07 PM]  
   168  @tony-yang @Katharine can I get an update on the health of things? It looks still very unhealthy from the outside
   169  >
   170  >Katharine [5:07 PM]  
   171  things are worse, tony disappeared somewhere, I'm working on it now
   172  >
   173  >cblecker [5:08 PM]  
   174  okay thank you. good luck. let me know if there is anything help a non-googler can provide
   175  >
   176  >Katharine [5:10 PM]  
   177  I'm bringing up 80 nodes with more disk space, which is what I can pull off given our quota headroom. hopefully this alleviates our current issue
   178  >
   179  >cjwagner [5:13 PM]  
   180  Anything I can do to help Katharine? Is bumping disk capacity our only mitigation?
   181  >
   182  >Katharine [5:14 PM]  
   183  It looks like most/all of our problems basically come back to I/O calls variously failing or being slow enough to look like they failed  
   184  more disk capacity and, really, the associated I/O capacity should bring us back below GCP aggressively throttling us  
   185  (Additionally we'll have almost twice as many nodes until we clean up, so that probably also helps)
   186  >
   187  >cjwagner [5:15 PM]  
   188  We could set a global `max_concurrency` for plank to throttle the overall job load on the cluster.
   189  >
   190  >Katharine [5:17 PM]  
   191  and by 80 I mean 55 because I forgot what the original quota request we made was for  
   192  anyway, 55 now, then I'll start shifting over capacity from the existing nodepool until we get back up to 100  
   193  (so we'll have 155 nodes in the interim)  
   194  new nodes appear to be up
   195  >
   196  >cjwagner [5:19 PM]  
   197  Cool. Lets hold off on a global throttle then and see if the additional nodes are sufficient.  
   198  Oh, sweet. That was fast. (edited) 
   199  >
   200  >liggitt [5:20 PM]  
   201  sometimes the cloud is cool
   202  >
   203  >cblecker [5:20 PM]  
   204  don't lie. computers are awful.
   205  >
   206  >Katharine [5:21 PM]  
   207  jobs are now being scheduled to them
   208  >
   209  >liggitt [5:21 PM]  
   210  "I do exactly what you tell me to. You told me to burn my CPUs into a slag heap repeatedly running CI jobs. Yes sir!"
   211  >
   212  >Katharine [5:21 PM]  
   213  I'll be back shortly  
   214  for now, working theory is that the extra 55 nodes should reduce pressure such that everything returns to normal shortly without needing to actually migrate capacity to the new nodepool so we'll probably just leave it like this for a bit unless behaviour fails to improve
   215  >
   216  >tony-yang [5:26 PM]  
   217  Thanks Katharine for helping out
   218  >
   219  >liggitt [5:28 PM]  
   220  >>liggitt [5:26 PM]
   221  shoutout to the #testing-ops folks who cheerfully keep our *insane* CI load moving forward during code freeze crunches  
   222  Posted in #shoutouts
   223  >
   224  >bentheelder [5:39 PM]  
   225  joined #testing-ops.
   226  >
   227  >bentheelder [5:41 PM]  
   228  I'm migrating some capacity from the smaller-disk node pool to the bigger one. (draining the old nodes -> increasing the new pool), will be a lot slower but we'll start to standardize on the bigger disks.
   229  >
   230  >Thankfully most of the time we're already doing _okay_ on I/O for the lower capacity, but not with consistently comfortable enough overhead...  
   231  This _should_ resolve a number of intermittent unexplained issues. Thanks @random-liu for root-causing this :slightly_smiling_face:
   232  >
   233  >Katharine [7:04 PM]  
   234  ![Throttled Disk Ops](./images/2019-05-30/throttled-disk-ops.png)  
   235  >
   236  >it looks like this did get a lot worse over the last day or two.  
   237  (I like how that doesn't have a key. green is throttled writes, red is throttled reads.)
   238  >
   239  >liggitt [7:14 PM]  
   240  is most of that on the scale jobs?  
   241  or do we have any way to slice by job
   242  >
   243  >Katharine [8:08 PM]  
   244  @liggitt I don't think we have any easy way to slice that by job - our best is probably nodes and times (edited) 
   245  >
   246  >cblecker [8:09 PM]  
   247  The 5k node job is still pretty unhealthy: https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&job=pull-kubernetes-kubemark-e2e-gce-big
   248  >
   249  >Katharine [8:10 PM]  
   250  That is should only be counting the build cluster though, so unless the scale jobs are doing consistently high writes in that build for hours, it seems unlikely to me that it's the sole cause.  
   251  Hm
   252  >
   253  >cblecker [8:11 PM]  
   254  the ones that are showing as "pending" don't appear to be running if you look at some the logs  
   255  and I think it's hit max concurrency as all the newer jobs are just sitting in triggers/aborted
   256  >
   257  >Katharine [8:14 PM]  
   258  max_concurrency on that job is 12  
   259  there are 12 running.  
   260  so it's not sad, just behaving as expected  
   261  the oldest running one started an hour ago and it takes 1-3 hours (edited)   
   262  so none of this seems unusual  
   263  at the earliest the current backlog on that job cannot be cleared for another probably six or seven hours  
   264  >
   265  >cblecker [8:17 PM]  
   266  hmm.. I must have clicked the wrong link or something because I swear I saw logs that show it finished but was still in a pending state.. but I don't see it now (things look just as you say)  
   267  thanks for checking
   268  >
   269  >bentheelder [11:55 PM]  
   270  done migrating. we've standardized on the bigger disks now  
   271  
   272  ### GKE EngProd Chat Log
   273  
   274  >Benjamin Elder [2:17 PM]  
   275  >FYI poking node team about k8s-prow-builds issues, we have some unhappy nodes in the containerd pool. Looks like kubelet <-> CRI RPC issues (?)  
   276  @Katharine Berry is also looking, we have pods that aren't being killed successfully and lots of testgrid entries with partial uploads. Getting spurious testgrid flake alerts due to jobs not uploading.
   277  >
   278  >Benjamin Elder [2:22 PM]  
   279  >Lantao is looking at these two:
   280  >```
   281  >gke-prow-containerd-pool-99179761-02q6   NotReady   <none>   14d     v1.13.5-gke.10
   282  >gke-prow-containerd-pool-99179761-313j   NotReady   <none>   14d     v1.13.5-gke.10
   283  >```
   284  >sounds like there's some kind of deadlock deleting pods sometimes  
   285  @Tony Yang
   286  >
   287  >Cole Wagner [2:37 PM]  
   288  The set of nodes in the NotReady state is changing.
   289  >```
   290  >$ kg nodes | grep -v "\bReady\b"
   291  >NAME                                     STATUS     ROLES    AGE     VERSION
   292  >gke-prow-containerd-pool-99179761-02q6   NotReady   <none>   14d     v1.13.5-gke.10
   293  >gke-prow-containerd-pool-99179761-4377   NotReady   <none>   14d     v1.13.5-gke.10
   294  >gke-prow-containerd-pool-99179761-t05g   NotReady   <none>   13d     v1.13.5-gke.10
   295  >$ kg nodes | grep -v "\bReady\b"
   296  >Cluster: gke_k8s-prow-builds_us-central1-f_prow
   297  >NAME                                     STATUS     ROLES    AGE     VERSION
   298  >gke-prow-containerd-pool-99179761-02q6   NotReady   <none>   14d     v1.13.5-gke.10
   299  >gke-prow-containerd-pool-99179761-s4k7   NotReady   <none>   162d    v1.13.5-gke.10
   300  >```
   301  >
   302  >Benjamin Elder [2:37 PM]  
   303  yeah, talking to lantao offfline, TLDR:
   304  >- we're seeing kernel panics (yay!)
   305  >- we're getting absurd I/O throttling because we do too much disk I/O and GCE is throttling us (glares at bazel)
   306  >- probably we can increase the disk sizes to avoid this
   307  >
   308  >Benjamin Elder [2:43 PM]  
   309  Estimated performance  
   310  Operation type	Read	Write  
   311  Sustained random IOPS limit	187.50	375.00  
   312  Sustained throughput limit (MB/s)	30.00	30.00  
   313  whereas we're actually hitting 80MB/s and > 450 IOPS when we're getting throttled
   314  >
   315  >Katharine Berry [2:47 PM]  
   316  other weirdness: pods that have been in the PodInitializing state for half an hour, kubelets that are in reboot loops
   317  >
   318  >Benjamin Elder [3:02 PM]  
   319  @Tony Yang https://github.com/kubernetes/test-infra/blob/master/experiment/maintenance/shift_nodepool_capacity.py
   320  >
   321  >Sen Lu [3:10 PM]  
   322  sounds like this gonna be one of those very exciting oncall rotation
   323  >
   324  >Benjamin Elder [3:11 PM]  
   325  probably not, this issue has been going on for a long time  
   326  this is what was previously written off as COS / containerd issues but is actually us using ridiculously high amounts of I/O vs our capacity.  
   327  it sounds like some of the other prows have this issue too
   328  >
   329  >Sen Lu [3:14 PM]  
   330  yes we've been making larger nodes for another prow as well  
   331  didn't know it's hitting k8s-prow as well
   332  >
   333  >Benjamin Elder [9:03 PM]  
   334  Katharine span up 55 nodes in a pool with 2x (500GB) disks, I'm waiting on the migrate script to move over the rest of the capacity still, we've crossed the majority of nodes being the 500GB type.  
   335  more in #testing-ops on slack
   336  >
   337  >Katharine Berry [9:03 PM]  
   338  Looking at prow, it looks good at a glance  
   339  We should probably work on a postmortem tomorrow  
   340  Especially given this caused code freeze to slip  
   341  >
   342  >Benjamin Elder [11:55 PM]  
   343  migrating capacity is done