k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-02-08.md (about) 1 # February 8th, 2019 - Prow responding slow & Partial outage 2 3 Created By: [Sen Lu] 4 Last Modified: 2019-02-08 5 6 ## Summary 7 8 [K8s Prow instance](prow.k8s.io) was scheduling test jobs onto prow service cluster, 9 which causes these jobs not finishing (won't start because of missing credentials), and the 10 service cluster running out of resources. 11 12 ## Impact 13 14 The jobs got scheduled onto service cluster was not able to start due to missing volumes. (Between 15 1:46pm - 5:30pm) 16 17 All other prow controllers are becoming very slow & unusable (~5min delay for a response from hook) 18 (Between 4:06pm (first notice) - 5:30pm) 19 20 21 ## Root Cause 22 23 Our build-cluster secret literally defines a "default" cluster alias which points to our build cluster, 24 which is not respected by the new client refactor, and by default prow schedules all workloads into the 25 service cluster (which is the default cluster alias). 26 27 See https://github.com/kubernetes/test-infra/pull/11209 for more context. 28 29 30 <!-- TODO(krzyzacy): better understand why this is not deterministic 31 https://github.com/kubernetes/test-infra/pull/11263#issuecomment-463312588 32 --> 33 34 ## Detection 35 36 https://github.com/kubernetes/test-infra/pull/11200#issuecomment-461988355 37 38 When @krzyzacy was poking around a PR, suddenly cats not working! @krzyzacy poked githubstatus.com 39 where everything is fine, but everything seems fine, @krzyzacy poked on slack, and fount out [prow node 40 is running at > 100% CPU and Memory](https://kubernetes.slack.com/archives/C7J9RP96G/p1549671585603700) 41 42 And then @krzyzacy poked `test-pods` namespace and found out there are 600 pending pods stuck because of 43 not finding the right secret. 44 45 ## Resolution 46 47 We rolled back prow before the problematic PR. 48 49 ## Lessons learned 50 51 ### What went well 52 53 - people are still around and we actively identified and mitigated the issue 54 55 ### What went poorly 56 57 - It's AGAIN Friday afternoon 58 59 - The behavior is not deterministic 60 61 - We really need a real staging Prow :-) We usually depend on each other for something like: hey openshift prow runs fine, so bump k8s prow should be safe, however each prow has different set ups, like how build cluster is defined, etc, so ideally for k8s prow we should have a very similar staging area.. 62 63 ### Where we got lucky 64 65 - It's Friday afternoon... and oncall first noticed the problem and we mitigate the issue before EoD 66 67 ## Action items 68 69 - We need to be more careful when deploying sweeping prow changes, aka, have more e2e tests, or have a 70 real staging Prow should give us more confidence. 71 72 ## Timeline 73 74 All times in PST, on February 8th 2019 75 76 * [13:20] Prow's auto-generated bump PR is merged (https://github.com/kubernetes/test-infra/pull/11172), triggering a deploy. 77 And Prow start to schedule workloads onto service cluster. 78 * [16:04] @krzyzacy noticed Prow is responding very slow, was suspecting a github issue again, but github reports all normal 79 * [16:19] @krzyzacy found out the CPU/memory usage for prow service nodes are abnormally high 80 * [16:21] @krzyzacy found out there are hundreds of pending pods in the service cluster, which should live in our build cluster 81 and from GKE console it shows there are 225 unschedulable pods 82 * [16:31] Other people, like @cblecker, start to find out prow had network issues 83 * [16:40] @krzyzacy rolled Prow back to `v20190206-b01b288` which is before the prow bump 84 in the meantime @Katharine cleaned up all the pods & prowjobs in the service cluster 85 * [16:44] @krzyzacy figured out we need a manual deployment, but missing permission to do so, filed https://github.com/kubernetes/test-infra/issues/11206 86 * [16:50] @krzyzacy deployed the revert manually via kubectl apply 87 * [17:07] @krzyzacy found the issue persisted, and confused 88 * [17:15] @stevekuznetsov asked `does your cluster yaml literally define a "default" cluster alias?` and figured out the issue 89 * [17:25] @stevekuznetsov sent out a fix as https://github.com/kubernetes/test-infra/pull/11209 90 @krzyzacy rolled Prow back to `v20190125-2aca69d` which is before the bug is introduced 91 * [17:59] @krzyzacy cleaned up all wrong pods and prowjobs again, and confirmed Prow is healthy. 92 * [18:09] @krzyzacy notified k-dev mailing list 93 94 ## Appendix 95 96 ### #testing-ops chat logs 97 98 > krzyzacy [1:46 PM] 99 we are on `v20190208-c35c7e3` now 100 `label-sync-1549660620-8nsx4 0/1 OOMKilled 0 29m` /shrug? :thisisfineparrot: 101 102 > krzyzacy [4:06 PM] 103 anyone seeing webhook issues? @stevekuznetsov @munnerz 104 or actually - `@krzyzacy: https://thecatapi.com appears to be down` :disappointed: 105 106 > Katharine [4:09 PM] 107 I am generally having issues accessing prow.k8s.io (edited) 108 109 > krzyzacy [4:15 PM] 110 ``` 111 time="2019-02-09T00:11:37Z" level=warning msg="Metric families 'name:\"go_memstats_sys_bytes\" help:\"Number of bytes obtained from system.\" type:GAUGE metric:<label:<name:\"instance\" value:\"hook-d5bd467f9-vr6hk\" > label:<name:\"job\" value:\"hook\" > gauge:<value:1.42801144e+08 > > ' and 'name:\"go_memstats_sys_bytes\" help:\"Number of bytes obtained by system. Sum of all system allocations.\" type:GAUGE metric:<label:<name:\"instance\" value:\"ghproxy-79c4f7695b-zmnhx\" > label:<name:\"job\" value:\"ghproxy\" > gauge:<value:5.33741372e+09 > > metric:<label:<name:\"instance\" value:\"plank-7d5d4bdb65-c2475\" > label:<name:\"job\" value:\"plank\" > gauge:<value:1.056411896e+09 > > metric:<label:<name:\"instance\" value:\"plank-7854db4c49-h9zpn\" > label:<name:\"job\" value:\"plank\" > gauge:<value:1.265917176e+09 > > ' are inconsistent, help and type of the latter will have priority. This is bad. Fix your pushed metrics!" source="diskmetricstore.go:114" 112 ``` 113 114 what does that mean :confused: (edited) 115 116 > krzyzacy [4:19 PM] 117 `gke-prow-default-pool-116f5d3e-bl6h 4022m 102% 13681Mi 110%` that's also worrisome... 118 119 > krzyzacy [4:21 PM] 120 there are hundreds of pods in the service cluster? 121 122 > cblecker [4:31 PM] 123 replied to a thread: 124 I'm seeing jobs that are failing on network stuff, failing to upload logs to GCS, and bot commands that aren't responding at all.View newer replies 125 126 > bentheelder [4:37 PM] 127 days since prow k8s client bugs: 0 128 129 > krzyzacy [4:40 PM] 130 https://github.com/kubernetes/test-infra/pull/11205 131 krzyzacy 132 #11205 Revert "Bump prow from v20190206-b01b288 to v20190208-c35c7e3" 133 134 > bentheelder [4:42 PM] 135 also seeing errors doing /lgtm, assigning is failing 136 137 > Katharine [4:42 PM] 138 I think this is all just fallout from it being crazy overloaded (edited) 139 140 > Katharine [4:42 PM] 141 110% is not a reasonable amount of scheduling 142 143 > stevekuznetsov [4:54 PM] 144 man 145 deploy friday evening 146 they said 147 it will be fine they said 148 149 > bentheelder [4:55 PM] 150 [:fire: :fire_engine: ongoing] 151 152 > stevekuznetsov [5:00 PM] 153 Question by the way @Katharine @bentheelder @krzyzacy are you using cpu/mem requests and limits? _very_ surprised that schduler was able to starve core prow services from doing things -- obviously scheduling to the service cluster is not what you wanted but also... 154 155 > krzyzacy [5:01 PM] 156 we have limits for build cluster! 157 158 > stevekuznetsov [5:02 PM] 159 Shouldn't the limits be on the Pods? 160 Oh you're saying no prow service deployments have limits/requests 161 162 > fejta [5:03 PM] 163 So we're scheduling all jobs in the service cluster? 164 165 > stevekuznetsov [5:03 PM] 166 not anymore 167 but that was the symptom seen 168 @fejta Sen posted the startup logs for the "good" plank deployment here 169 krzyzacy 170 ``` 171 senlu@senlu:~/work/src/k8s.io/test-infra/prow/cluster$ kubectl logs plank-64c7c5cbd4-fx5g6 172 {"component":"plank","level":"info","msg":"Loading cluster contexts...","time":"2019-02-09T00:50:01Z"} 173 {"component":"plank","level":"info","msg":"* in-cluster","time":"2019-02-09T00:50:01Z"} 174 {"component":"plank","level":"info","msg":"* test-infra-trusted","time":"2019-02-09T00:50:01Z"} 175 {"component":"plank","level":"info","msg":"* default","time":"2019-02-09T00:50:01Z"} 176 ``` 177 Show more 178 From a thread in #testing-opsFeb 8thView reply 179 has not posted a "bad" one yet 180 181 krzyzacy [5:04 PM] 182 ``` 183 E {"level":"info","component":"plank","msg":"Loading cluster contexts..."} 184 E {"component":"plank","msg":"* in-cluster","level":"info"} 185 E {"component":"plank","msg":"* k8s-tpu-operator","level":"info"} 186 E {"component":"plank","msg":"* security","level":"info"} 187 E {"component":"plank","msg":"* test-infra-trusted","level":"info"} 188 E {"component":"plank","msg":"* default","level":"info"} 189 ``` 190 is that enough? 191 192 > stevekuznetsov [5:05 PM] 193 is there a diff between the two? 194 The "good" one seems to have 195 196 ``` 197 {"component":"plank","level":"info","msg":"* gke_rules-k8s_us-central1-f_testing","time":"2019-02-09T00:50:01Z"} 198 ``` 199 whereas the bad one does not? 200 Which is the alias for the service cluster? 201 202 > stevekuznetsov [5:06 PM] 203 (why is the kubeconfig for the service cluster even in your clusters yaml given to plank to schedule pods????) 204 205 > stevekuznetsov [5:06 PM] 206 Also I need to jump but if you confirm the SHA bounds that @Katharine posted ... there is no suspicious code change in that deployment AFAICT 207 208 ``` 209 $ git log --no-merges --pretty='%cd | %h | %<(20)%an | %s' --date=short b01b288..c35c7e3 -- prow/ ':!prow/config.yaml' ':!prow/plugins.yaml' ':!prow/cluster' 210 2019-02-07 | 73ab4a815 | Kyle Weaver | Deck: add missing org to GitHub links 211 2019-02-07 | 4bef440ea | Katharine Berry | Satisfy the TypeScript linter. 212 2019-02-07 | fabe99ef2 | Katharine Berry | Update prow/cmd/deck/static/prow/prow.ts 213 2019-02-07 | f58a0539c | Matthias Bertschy | Adding PluginHelp.Config for size 214 2019-02-06 | acdf2fbb3 | Benjamin Elder | register help for buildifier 215 2019-02-06 | 9d5466153 | Mikhail Konovalov | update broken link 216 2019-01-26 | e9312fed6 | Alvaro Aleman | Fix spyglass cell generation when using spyglass as primary viewer 217 ``` 218 219 > Katharine [5:07 PM] 220 We've tracked the start of bad scheduling to three minutes after the last deploy 221 222 > krzyzacy [5:07 PM] 223 I think the problem persists after I revert back to wednesday 224 225 > stevekuznetsov [5:07 PM] 226 Do you have API server audit logs for your secrets? 227 Are you certain your secrets did not change? 228 229 > Katharine [5:08 PM] 230 (maybe the deployer itself is busted?) 231 232 > stevekuznetsov [5:08 PM] 233 oh okay so per @fejta the issue is an incorrect _namespace_ in the correct _cluster_ 234 235 > Katharine [5:09 PM] 236 Incorrect jobs in the correct namespace in the wrong cluster 237 238 > stevekuznetsov [5:09 PM] 239 ah 240 241 > Katharine [5:09 PM] 242 There are jobs that should go there, but they are not these ones 243 244 > stevekuznetsov [5:09 PM] 245 can you confirm that your cluster yaml file has the correct alias mapping to the correct cluster 246 were you also seeing trusted jobs scheduled to the normal build clsuter 247 was it jobs _regardless of alias_ scheduling to the trusted cluster 248 or was it only jobs for the normal build alias scheduling to the build cluster 249 250 > Katharine [5:10 PM] 251 I haven't checked, don't know if anyone else has. Will take a look at what's going on there 252 253 > stevekuznetsov [5:10 PM] 254 Is it still going on? 255 (audit logs for the secret holding your cluster yaml would be huge) 256 257 > Katharine [5:11 PM] 258 apparently yes. 259 260 > krzyzacy [5:12 PM] 261 hummmm - the one from Wednesday seems to be good? 262 (but not anymore?) 263 264 > Katharine [5:13 PM] 265 we were still creating pods in the build cluster after we started also creating them in the service cluster 266 267 > krzyzacy [5:13 PM] 268 ``` 269 senlu@senlu:~/work/src/k8s.io/test-infra/prow/cluster$ kubectl get secret 270 NAME TYPE DATA AGE 271 build-cluster Opaque 1 352d 272 ``` 273 274 that didn't got a change 275 276 > fejta [5:13 PM] 277 I don't think it is that exciting, just a bad release 278 279 > fejta [5:14 PM] 280 `kubectl --context=prow get pods --namespace=test-pods -o yaml | grep prow.k8s.io/job` 281 basically just looks like everything is scheduling in the service cluster now 282 ``` 283 $ kubectl --context=prow-builds get pods --namespace=test-pods -o yaml | grep prow.k8s.io/job | wc -l 284 532 285 fejta-macbookpro:test-infra fejta$ kubectl --context=prow get pods --namespace=test-pods -o yaml | grep prow.k8s.io/job | wc -l 286 1308 287 ``` 288 289 > stevekuznetsov [5:15 PM] 290 ``` 291 kubectl --context=prow get pods --namespace=test-pods -o "jsonpath={.items[*].spec.job}" | tr ' ' $'\n' | sort | uniq -c | sort -nr 292 ``` 293 hrm 294 so it looks like the "default" cluster was used for normal jobs in the past 295 does your cluster yaml literally define a "default" cluster alias? 296 297 > fejta [5:17 PM] 298 yes 299 300 > stevekuznetsov [5:18 PM] 301 I think I see the bug 302 very interesting 303 I think this is just not deterministic 304 but is a bad commit from before 305 always latent 306 307 > bentheelder [5:19 PM] 308 replied to a thread: 309 A lot of these were already cleaned up. Everything in the wrong cluster means nothing running ...? 310 Plus prow is non-responsive / broken in general because the services cluster is overloaded.View newer replies 311 312 > krzyzacy [5:21 PM] 313 hummm 314 shall we roll back to last week? 315 316 > fejta [5:21 PM] 317 yes 318 these client refactors are a dud :stuck_out_tongue: 319 We really ought not to have refactored every binary at once, and gotten some signal that things work first. Oh well. 320 We should consider reverting plank to use the old client until we can prove things work correctly for the other binaries. 321 Not sure how much trouble that will be at this point. 322 323 > stevekuznetsov [5:24 PM] 324 you want to revert to at least `ca4be59af7dc673b3b5fb3a16711bee21b4e736a` 325 326 > bentheelder [5:24 PM] 327 aside: this failure mode is odd https://github.com/kubernetes/sig-release/pull/467#issuecomment-461994843 (edited) 328 329 > stevekuznetsov [5:24 PM] 330 reverting commit to `plank` should be fine 331 332 > krzyzacy [5:25 PM] 333 https://github.com/kubernetes/test-infra/pull/11205 334 krzyzacy 335 #11205 Revert prow to v20190125-2aca69d 336 This reverts commit e6a015b. 337 338 /assign @BenTheElder @Katharine 339 Assignees 340 Katharine, BenTheElder 341 Labels 342 approved, area/config, area/prow, area/prow/bump, cncf-cla: yes, sig/testing, size/M 343 kubernetes/test-infraFeb 8thAdded by GitHub 344 I believe if we only revert plank, sinker will also be busted 345 lol I start to type things like `kubectl commit --amend` 346 347 > krzyzacy [5:59 PM] 348 ``` 349 senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl top nodes 350 NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% 351 gke-prow-default-pool-116f5d3e-1t67 85m 2% 5228Mi 42% 352 gke-prow-default-pool-116f5d3e-2jr0 141m 3% 2948Mi 23% 353 gke-prow-default-pool-116f5d3e-3s34 212m 5% 4702Mi 37% 354 gke-prow-default-pool-116f5d3e-8q5f 351m 8% 3905Mi 31% 355 gke-prow-default-pool-116f5d3e-bl6h 2349m 59% 3519Mi 28% 356 gke-prow-default-pool-116f5d3e-csc4 252m 6% 3359Mi 27% 357 gke-prow-default-pool-116f5d3e-qdpu 70m 1% 2505Mi 20% 358 gke-prow-default-pool-116f5d3e-rdf6 1531m 39% 1721Mi 13% 359 gke-prow-ghproxy-2cc2c707-n3p7 76m 0% 2574Mi 9% 360 senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods 361 NAME READY STATUS RESTARTS AGE 362 a17e4af9-2c0c-11e9-b21b-0a580a6c0715 2/2 Running 0 9m 363 cm-acme-http-solver-qswmf 1/1 Running 0 9m 364 cm-acme-http-solver-vfcsk 1/1 Running 0 9m 365 ``` 366 367 things should be fine now - there's also a pending fix https://github.com/kubernetes/test-infra/pull/11209 - will send out an email to k-dev shortly. 368 stevekuznetsov 369 #11209 Don't overwrite a provided literal default client 370 Signed-off-by: Steve Kuznetsov skuznets@redhat.com 371 372 This fixes the issue we had today 373 374 /assign @fejta @cjwagner 375 /cc @BenTheElder @krzyzacy 376 Assignees 377 fejta, cjwagner 378 Labels 379 approved, area/prow, cncf-cla: yes, sig/testing, size/XS 380 kubernetes/test-infraFeb 8thAdded by GitHub 381 382 ---