k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-02-08.md (about)

     1  # February 8th, 2019 - Prow responding slow & Partial outage
     2  
     3  Created By: [Sen Lu]  
     4  Last Modified: 2019-02-08
     5  
     6  ## Summary
     7  
     8  [K8s Prow instance](prow.k8s.io) was scheduling test jobs onto prow service cluster,
     9  which causes these jobs not finishing (won't start because of missing credentials), and the
    10  service cluster running out of resources. 
    11  
    12  ## Impact
    13  
    14  The jobs got scheduled onto service cluster was not able to start due to missing volumes. (Between
    15  1:46pm - 5:30pm)
    16  
    17  All other prow controllers are becoming very slow & unusable (~5min delay for a response from hook)
    18  (Between 4:06pm (first notice) - 5:30pm)
    19  
    20  
    21  ## Root Cause
    22  
    23  Our build-cluster secret literally defines a "default" cluster alias which points to our build cluster,
    24  which is not respected by the new client refactor, and by default prow schedules all workloads into the
    25  service cluster (which is the default cluster alias).
    26  
    27  See https://github.com/kubernetes/test-infra/pull/11209 for more context.
    28  
    29  
    30  <!-- TODO(krzyzacy): better understand why this is not deterministic
    31    https://github.com/kubernetes/test-infra/pull/11263#issuecomment-463312588
    32   -->
    33  
    34  ## Detection
    35  
    36  https://github.com/kubernetes/test-infra/pull/11200#issuecomment-461988355
    37  
    38  When @krzyzacy was poking around a PR, suddenly cats not working! @krzyzacy poked githubstatus.com
    39  where everything is fine, but everything seems fine, @krzyzacy poked on slack, and fount out [prow node
    40  is running at > 100% CPU and Memory](https://kubernetes.slack.com/archives/C7J9RP96G/p1549671585603700)
    41  
    42  And then @krzyzacy poked `test-pods` namespace and found out there are 600 pending pods stuck because of
    43  not finding the right secret.
    44  
    45  ## Resolution
    46  
    47  We rolled back prow before the problematic PR.
    48  
    49  ## Lessons learned
    50  
    51  ### What went well
    52  
    53  - people are still around and we actively identified and mitigated the issue
    54  
    55  ### What went poorly
    56  
    57  - It's AGAIN Friday afternoon
    58  
    59  - The behavior is not deterministic
    60  
    61  - We really need a real staging Prow :-) We usually depend on each other for something like: hey openshift prow runs fine, so bump k8s prow should be safe, however each prow has different set ups, like how build cluster is defined, etc, so ideally for k8s prow we should have a very similar staging area..
    62  
    63  ### Where we got lucky
    64  
    65  - It's Friday afternoon... and oncall first noticed the problem and we mitigate the issue before EoD
    66  
    67  ## Action items
    68  
    69  - We need to be more careful when deploying sweeping prow changes, aka, have more e2e tests, or have a
    70    real staging Prow should give us more confidence.
    71  
    72  ## Timeline
    73  
    74  All times in PST, on February 8th 2019
    75  
    76  * [13:20] Prow's auto-generated bump PR is merged (https://github.com/kubernetes/test-infra/pull/11172), triggering a deploy.
    77            And Prow start to schedule workloads onto service cluster.
    78  * [16:04] @krzyzacy noticed Prow is responding very slow, was suspecting a github issue again, but github reports all normal
    79  * [16:19] @krzyzacy found out the CPU/memory usage for prow service nodes are abnormally high
    80  * [16:21] @krzyzacy found out there are hundreds of pending pods in the service cluster, which should live in our build cluster
    81            and from GKE console it shows there are 225 unschedulable pods
    82  * [16:31] Other people, like @cblecker, start to find out prow had network issues
    83  * [16:40] @krzyzacy rolled Prow back to `v20190206-b01b288` which is before the prow bump
    84            in the meantime @Katharine cleaned up all the pods & prowjobs in the service cluster
    85  * [16:44] @krzyzacy figured out we need a manual deployment, but missing permission to do so, filed https://github.com/kubernetes/test-infra/issues/11206
    86  * [16:50] @krzyzacy deployed the revert manually via kubectl apply
    87  * [17:07] @krzyzacy found the issue persisted, and confused
    88  * [17:15] @stevekuznetsov asked `does your cluster yaml literally define a "default" cluster alias?` and figured out the issue
    89  * [17:25] @stevekuznetsov sent out a fix as https://github.com/kubernetes/test-infra/pull/11209
    90            @krzyzacy rolled Prow back to `v20190125-2aca69d` which is before the bug is introduced
    91  * [17:59] @krzyzacy cleaned up all wrong pods and prowjobs again, and confirmed Prow is healthy.
    92  * [18:09] @krzyzacy notified k-dev mailing list
    93  
    94  ## Appendix 
    95  
    96  ### #testing-ops chat logs
    97  
    98  > krzyzacy [1:46 PM]
    99  we are on `v20190208-c35c7e3` now
   100  `label-sync-1549660620-8nsx4             0/1       OOMKilled   0          29m` /shrug? :thisisfineparrot:
   101  
   102  > krzyzacy [4:06 PM]
   103  anyone seeing webhook issues? @stevekuznetsov @munnerz
   104  or actually - `@krzyzacy: https://thecatapi.com appears to be down` :disappointed:
   105  
   106  > Katharine [4:09 PM]
   107  I am generally having issues accessing prow.k8s.io (edited) 
   108  
   109  > krzyzacy [4:15 PM]
   110  ```
   111  time="2019-02-09T00:11:37Z" level=warning msg="Metric families 'name:\"go_memstats_sys_bytes\" help:\"Number of bytes obtained from system.\" type:GAUGE metric:<label:<name:\"instance\" value:\"hook-d5bd467f9-vr6hk\" > label:<name:\"job\" value:\"hook\" > gauge:<value:1.42801144e+08 > > ' and 'name:\"go_memstats_sys_bytes\" help:\"Number of bytes obtained by system. Sum of all system allocations.\" type:GAUGE metric:<label:<name:\"instance\" value:\"ghproxy-79c4f7695b-zmnhx\" > label:<name:\"job\" value:\"ghproxy\" > gauge:<value:5.33741372e+09 > > metric:<label:<name:\"instance\" value:\"plank-7d5d4bdb65-c2475\" > label:<name:\"job\" value:\"plank\" > gauge:<value:1.056411896e+09 > > metric:<label:<name:\"instance\" value:\"plank-7854db4c49-h9zpn\" > label:<name:\"job\" value:\"plank\" > gauge:<value:1.265917176e+09 > > ' are inconsistent, help and type of the latter will have priority. This is bad. Fix your pushed metrics!" source="diskmetricstore.go:114"
   112  ```
   113  
   114  what does that mean :confused: (edited) 
   115  
   116  > krzyzacy [4:19 PM]
   117  `gke-prow-default-pool-116f5d3e-bl6h   4022m        102%      13681Mi         110%` that's also worrisome...
   118  
   119  > krzyzacy [4:21 PM]
   120  there are hundreds of pods in the service cluster?
   121  
   122  > cblecker [4:31 PM]
   123  replied to a thread:
   124  I'm seeing jobs that are failing on network stuff, failing to upload logs to GCS, and bot commands that aren't responding at all.View newer replies
   125  
   126  > bentheelder [4:37 PM]
   127  days since prow k8s client bugs: 0
   128  
   129  > krzyzacy [4:40 PM]
   130  https://github.com/kubernetes/test-infra/pull/11205
   131  krzyzacy
   132  #11205 Revert "Bump prow from v20190206-b01b288 to v20190208-c35c7e3"
   133  
   134  > bentheelder [4:42 PM]
   135  also seeing errors doing /lgtm, assigning is failing
   136  
   137  > Katharine [4:42 PM]
   138  I think this is all just fallout from it being crazy overloaded (edited) 
   139  
   140  > Katharine [4:42 PM]
   141  110% is not a reasonable amount of scheduling
   142  
   143  > stevekuznetsov [4:54 PM]
   144  man
   145  deploy friday evening
   146  they said
   147  it will be fine they said
   148  
   149  > bentheelder [4:55 PM]
   150  [:fire: :fire_engine:  ongoing]
   151  
   152  > stevekuznetsov [5:00 PM]
   153  Question by the way @Katharine @bentheelder @krzyzacy are you using cpu/mem requests and limits? _very_ surprised that schduler was able to starve core prow services from doing things -- obviously scheduling to the service cluster is not what you wanted but also...
   154  
   155  > krzyzacy [5:01 PM]
   156  we have limits for build cluster!
   157  
   158  > stevekuznetsov [5:02 PM]
   159  Shouldn't the limits be on the Pods?
   160  Oh you're saying no prow service deployments have limits/requests
   161  
   162  > fejta [5:03 PM]
   163  So we're scheduling all jobs in the service cluster?
   164  
   165  > stevekuznetsov [5:03 PM]
   166  not anymore
   167  but that was the symptom seen
   168  @fejta Sen posted the startup logs for the "good" plank deployment here
   169  krzyzacy
   170  ```
   171  senlu@senlu:~/work/src/k8s.io/test-infra/prow/cluster$ kubectl logs plank-64c7c5cbd4-fx5g6
   172  {"component":"plank","level":"info","msg":"Loading cluster contexts...","time":"2019-02-09T00:50:01Z"}
   173  {"component":"plank","level":"info","msg":"* in-cluster","time":"2019-02-09T00:50:01Z"}
   174  {"component":"plank","level":"info","msg":"* test-infra-trusted","time":"2019-02-09T00:50:01Z"}
   175  {"component":"plank","level":"info","msg":"* default","time":"2019-02-09T00:50:01Z"}
   176  ```
   177   Show more
   178  From a thread in #testing-opsFeb 8thView reply
   179  has not posted a "bad" one yet
   180  
   181  krzyzacy [5:04 PM]
   182  ```
   183  E  {"level":"info","component":"plank","msg":"Loading cluster contexts..."} 
   184  E  {"component":"plank","msg":"* in-cluster","level":"info"} 
   185  E  {"component":"plank","msg":"* k8s-tpu-operator","level":"info"} 
   186  E  {"component":"plank","msg":"* security","level":"info"} 
   187  E  {"component":"plank","msg":"* test-infra-trusted","level":"info"} 
   188  E  {"component":"plank","msg":"* default","level":"info"}
   189  ``` 
   190  is that enough?
   191  
   192  > stevekuznetsov [5:05 PM]
   193  is there a diff between the two?
   194  The "good" one seems to have
   195  
   196  ```
   197  {"component":"plank","level":"info","msg":"* gke_rules-k8s_us-central1-f_testing","time":"2019-02-09T00:50:01Z"}
   198  ```
   199  whereas the bad one does not?
   200  Which is the alias for the service cluster?
   201  
   202  > stevekuznetsov [5:06 PM]
   203  (why is the kubeconfig for the service cluster even in your clusters yaml given to plank to schedule pods????)
   204  
   205  > stevekuznetsov [5:06 PM]
   206  Also I need to jump but if you confirm the SHA bounds that @Katharine posted ... there is no suspicious code change in that deployment AFAICT
   207  
   208  ```
   209  $ git log --no-merges --pretty='%cd | %h | %<(20)%an | %s' --date=short b01b288..c35c7e3 -- prow/ ':!prow/config.yaml' ':!prow/plugins.yaml' ':!prow/cluster'
   210  2019-02-07 | 73ab4a815 | Kyle Weaver          | Deck: add missing org to GitHub links
   211  2019-02-07 | 4bef440ea | Katharine Berry      | Satisfy the TypeScript linter.
   212  2019-02-07 | fabe99ef2 | Katharine Berry      | Update prow/cmd/deck/static/prow/prow.ts
   213  2019-02-07 | f58a0539c | Matthias Bertschy    | Adding PluginHelp.Config for size
   214  2019-02-06 | acdf2fbb3 | Benjamin Elder       | register help for buildifier
   215  2019-02-06 | 9d5466153 | Mikhail Konovalov    | update broken link
   216  2019-01-26 | e9312fed6 | Alvaro Aleman        | Fix spyglass cell generation when using spyglass as primary viewer
   217  ```
   218  
   219  > Katharine [5:07 PM]
   220  We've tracked the start of bad scheduling to three minutes after the last deploy
   221  
   222  > krzyzacy [5:07 PM]
   223  I think the problem persists after I revert back to wednesday
   224  
   225  > stevekuznetsov [5:07 PM]
   226  Do you have API server audit logs for your secrets?
   227  Are you certain your secrets did not change?
   228  
   229  > Katharine [5:08 PM]
   230  (maybe the deployer itself is busted?)
   231  
   232  > stevekuznetsov [5:08 PM]
   233  oh okay so per @fejta the issue is an incorrect _namespace_ in the correct _cluster_
   234  
   235  > Katharine [5:09 PM]
   236  Incorrect jobs in the correct namespace in the wrong cluster
   237  
   238  > stevekuznetsov [5:09 PM]
   239  ah
   240  
   241  > Katharine [5:09 PM]
   242  There are jobs that should go there, but they are not these ones
   243  
   244  > stevekuznetsov [5:09 PM]
   245  can you confirm that your cluster yaml file has the correct alias mapping to the correct cluster
   246  were you also seeing trusted jobs scheduled to the normal build clsuter
   247  was it jobs _regardless of alias_ scheduling to the trusted cluster
   248  or was it only jobs for the normal build alias scheduling to the build cluster
   249  
   250  > Katharine [5:10 PM]
   251  I haven't checked, don't know if anyone else has. Will take a look at what's going on there
   252  
   253  > stevekuznetsov [5:10 PM]
   254  Is it still going on?
   255  (audit logs for the secret holding your cluster yaml would be huge)
   256  
   257  > Katharine [5:11 PM]
   258  apparently yes.
   259  
   260  > krzyzacy [5:12 PM]
   261  hummmm - the one from Wednesday seems to be good?
   262  (but not anymore?)
   263  
   264  > Katharine [5:13 PM]
   265  we were still creating pods in the build cluster after we started also creating them in the service cluster
   266  
   267  > krzyzacy [5:13 PM]
   268  ```
   269  senlu@senlu:~/work/src/k8s.io/test-infra/prow/cluster$ kubectl get secret
   270  NAME                     TYPE                                  DATA      AGE
   271  build-cluster            Opaque                                1         352d
   272  ```
   273  
   274  that didn't got a change
   275  
   276  > fejta [5:13 PM]
   277  I don't think it is that exciting, just a bad release
   278  
   279  > fejta [5:14 PM]
   280  `kubectl --context=prow get pods --namespace=test-pods -o yaml  | grep prow.k8s.io/job`
   281  basically just looks like everything is scheduling in the service cluster now
   282  ```
   283  $ kubectl --context=prow-builds get pods --namespace=test-pods -o yaml  | grep prow.k8s.io/job | wc -l
   284       532
   285  fejta-macbookpro:test-infra fejta$ kubectl --context=prow get pods --namespace=test-pods -o yaml  | grep prow.k8s.io/job | wc -l
   286      1308
   287  ```
   288  
   289  > stevekuznetsov [5:15 PM]
   290  ```
   291  kubectl --context=prow get pods --namespace=test-pods -o "jsonpath={.items[*].spec.job}" | tr ' ' $'\n' | sort | uniq -c | sort -nr
   292  ```
   293  hrm
   294  so it looks like the "default" cluster was used for normal jobs in the past
   295  does your cluster yaml literally define a "default" cluster alias?
   296  
   297  > fejta [5:17 PM]
   298  yes
   299  
   300  > stevekuznetsov [5:18 PM]
   301  I think I see the bug
   302  very interesting
   303  I think this is just not deterministic
   304  but is a bad commit from before
   305  always latent
   306  
   307  > bentheelder [5:19 PM]
   308  replied to a thread:
   309  A lot of these were already cleaned up. Everything in the wrong cluster means nothing running ...?
   310  Plus prow is non-responsive / broken in general because the services cluster is overloaded.View newer replies
   311  
   312  > krzyzacy [5:21 PM]
   313  hummm
   314  shall we roll back to last week?
   315  
   316  > fejta [5:21 PM]
   317  yes
   318  these client refactors are a dud :stuck_out_tongue:
   319  We really ought not to have refactored every binary at once, and gotten some signal that things work first. Oh well.
   320  We should consider reverting plank to use the old client until we can prove things work correctly for the other binaries.
   321  Not sure how much trouble that will be at this point.
   322  
   323  > stevekuznetsov [5:24 PM]
   324  you want to revert to at least `ca4be59af7dc673b3b5fb3a16711bee21b4e736a`
   325  
   326  > bentheelder [5:24 PM]
   327  aside: this failure mode is odd https://github.com/kubernetes/sig-release/pull/467#issuecomment-461994843 (edited) 
   328  
   329  > stevekuznetsov [5:24 PM]
   330  reverting commit to `plank` should be fine
   331  
   332  > krzyzacy [5:25 PM]
   333  https://github.com/kubernetes/test-infra/pull/11205
   334  krzyzacy
   335  #11205 Revert prow to v20190125-2aca69d
   336  This reverts commit e6a015b.
   337  
   338  /assign @BenTheElder @Katharine
   339  Assignees
   340  Katharine, BenTheElder
   341  Labels
   342  approved, area/config, area/prow, area/prow/bump, cncf-cla: yes, sig/testing, size/M
   343  kubernetes/test-infraFeb 8thAdded by GitHub
   344  I believe if we only revert plank, sinker will also be busted
   345  lol I start to type things like `kubectl commit --amend`
   346  
   347  > krzyzacy [5:59 PM]
   348  ```
   349  senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl top nodes
   350  NAME                                  CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
   351  gke-prow-default-pool-116f5d3e-1t67   85m          2%        5228Mi          42%       
   352  gke-prow-default-pool-116f5d3e-2jr0   141m         3%        2948Mi          23%       
   353  gke-prow-default-pool-116f5d3e-3s34   212m         5%        4702Mi          37%       
   354  gke-prow-default-pool-116f5d3e-8q5f   351m         8%        3905Mi          31%       
   355  gke-prow-default-pool-116f5d3e-bl6h   2349m        59%       3519Mi          28%       
   356  gke-prow-default-pool-116f5d3e-csc4   252m         6%        3359Mi          27%       
   357  gke-prow-default-pool-116f5d3e-qdpu   70m          1%        2505Mi          20%       
   358  gke-prow-default-pool-116f5d3e-rdf6   1531m        39%       1721Mi          13%       
   359  gke-prow-ghproxy-2cc2c707-n3p7        76m          0%        2574Mi          9%        
   360  senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods
   361  NAME                                   READY     STATUS    RESTARTS   AGE
   362  a17e4af9-2c0c-11e9-b21b-0a580a6c0715   2/2       Running   0          9m
   363  cm-acme-http-solver-qswmf              1/1       Running   0          9m
   364  cm-acme-http-solver-vfcsk              1/1       Running   0          9m
   365  ```
   366  
   367  things should be fine now - there's also a pending fix https://github.com/kubernetes/test-infra/pull/11209 - will send out an email to k-dev shortly.
   368  stevekuznetsov
   369  #11209 Don't overwrite a provided literal default client
   370  Signed-off-by: Steve Kuznetsov skuznets@redhat.com
   371  
   372  This fixes the issue we had today
   373  
   374  /assign @fejta @cjwagner  
   375  /cc @BenTheElder @krzyzacy
   376  Assignees
   377  fejta, cjwagner
   378  Labels
   379  approved, area/prow, cncf-cla: yes, sig/testing, size/XS
   380  kubernetes/test-infraFeb 8thAdded by GitHub
   381  
   382  ---