k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/post-mortems/2019-03-26.md (about) 1 # March 26th, 2019 Boskos Outage 2 3 All times in PDT (UTC-7). 4 5 ## Summary 6 7 The Boskos instance responsible for vending GCP projects to test jobs got stuck in a crash loop 8 due to a hidden bug that was unexpectedly deployed at night. Consequently, all jobs that needed 9 a project from boskos (including a required job for all kubernetes/kubernetes PRs) failed for 10 just over nine hours, while on-call was sleeping. 11 12 ## Impact 13 14 All jobs depending on boskos failed between 2019-02-25 23:04 and 2019-02-26 08:12. As a result 15 it was impossible to merge any PR in kubernetes/kubernetes during those nine hours, because 16 `pull-kubernetes-e2e-gce` would always fail. There were similar effects on other projects during 17 this time period. On `pull-kubernetes-e2e-gce` alone, 70 failures were caused by this outage. 18 19 ## Root Cause 20 21 A [bug was introduced](https://github.com/kubernetes/test-infra/commit/02110870507abdeb93dfd6205c70f82145f8993c#r32917192) 22 in a commit that landed in December last year, that when deployed would cause Boskos to be unable 23 to start up. Nobody actually attempted to deploy Boskos until last week (2019-03-20). However, while 24 the StatefulSet that runs Boskos was successfully updated, the actual pod was not replaced due to 25 lack of any configuration to do so, and nobody noticed the lack of functional change. 26 27 At around 17:20, several hours before the outage, we started upgrading the Prow clusters from Kubernetes 28 1.11 to 1.13. The build cluster has 150 nodes, and each update takes five minutes, so for this entire 29 process to complete would take about ten hours. Several hours later, nothing looked amiss, and we 30 went to bed. 31 32 At 23:04 the node running the only Boskos pod was drained to be replaced with a 1.13 node, and boskos 33 was moved to a new node. At this point it used the StatefulSet's spec, which specified a newer image 34 version with the bug. As a result, boskos was never able to come up on the new node. 35 36 ## Detection 37 38 Users reported issues at 02:50 (via GitHub), 05:16 and 06:27 (both via Slack). While monitoring 39 [clearly shows the failure](http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=1553576353087&to=1553615983139), 40 no alerting caught it. 41 42 ## Resolution 43 44 Once the bug was found, a new image was built and deployed at 10:04. Boskos recovered immediately, 45 and the projects left hanging open were cleaned up over the next few minutes. Tests began passing 46 shortly thereafter. 47 48 ## Lessons Learned 49 50 ### What went well 51 52 - Once people were around to deal with it, the problem was resolved in fairly short order. 53 54 ### What went poorly 55 56 - Beginning a ten-hour deploy process shortly after 5pm when we only have coverage until 6pm is probably inadvisable 57 - Deployments should be monitored to verify that some actual deployment actually occurred (or, if that's not feasible, not performed at all) 58 - Boskos' StatefulSet is not configured to perform updates, leading to surprising lack of action 59 - It appears that the change committed in December was never tested 60 - Unlike much of Prow, boskos deployments are handled manually, increasing the room for error 61 62 ### Where we got lucky 63 64 We didn't, really. 65 66 ## Action Items 67 68 - Update Boskos' configuration so deployments actually happen as expected ([#11956](https://github.com/kubernetes/test-infra/issues/11956)) 69 - Automate Boskos deploys ([#11957](https://github.com/kubernetes/test-infra/issues/11957)) 70 - Improve alerting around boskos status ([#11953](https://github.com/kubernetes/test-infra/issues/11953)) 71 72 ## Timeline 73 74 - [12-05 16:21] Crashloop bug preventing Boskos from updating is introduced 75 - [03-20 16:46] Boskos StatefulSet spec is updated but the pod is not updated 76 - [03-26 17:40] Kubernetes node upgrade is initiated on the build cluster 77 - [03-26 23:04] The node hosting boskos is drained. The replacement pod is stuck in a crash loop. (**outage starts**) 78 - [03-26 02:50] The outage is reported on GitHub by @mborsz 79 - [03-26 05:16] The outage is reported on Slack by @chenk008 80 - [03-26 06:27] The outage is reported on Slack again by @liggitt 81 - [03-26 07:16] @Katharine wakes up and notices the Slack reports 82 - [03-26 07:37] @liggitt finds the cause of the failure 83 - [03-26 08:10] @Katharine successfully deploys a fixed Boskos (**outage ends**) 84 85 ## Apendix 86 87 ### #sig-testing chat log 88 89 > liggitt [6:27 AM] 90 > @Katharine is the "Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: Post http://boskos.test-pods.svc.cluster.local./acquire?type=gce-project&state=free&dest=busy&owner=pull-kubernetes-e2e-gce" error related to the tot -> snowflake ID change or is that unrelated? (edited) 91 > 92 > liggitt [6:27 AM] 93 > seeing that on all e2e-gce jobs 94 > 95 > Katharine [07:16 AM] 96 Haven't made the snowflake ID change yet so hope it's not preemptively related 97 > 98 >liggitt [07:17 AM] 99 >heh, ok 100 > 101 >liggitt [07:17 AM] 102 >"winter is coming" 103 > 104 >Katharine [07:20 AM] 105 >That is a lot of failures 106 > 107 >Katharine [07:20 AM] 108 >```Post http://boskos.test-pods.svc.cluster.local./acquire?type=gce-project&state=free&dest=busy&owner=pull-kubernetes-e2e-gce: dial tcp 10.63.250.132:80: i/o timeout``` 109 > 110 >Katharine [07:21 AM] 111 >Looks like boskos died completely shortly before midnight PT 112 > 113 >liggitt [07:23 AM] 114 >ah 115 > 116 >Katharine [07:24 AM] 117 >Strongly suspect the k8s upgrade we did yesterday evening 118 > 119 >Katharine [07:25 AM] 120 >```{"error":"CustomResourceDefinition.apiextensions.k8s.io \"resources.boskos.k8s.io\" is invalid: [spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion(nil): must have exactly one version marked as storage version, status.storedVersions: Invalid value: []string(nil): must have at least one stored version]","level":"fatal","msg":"unable to create a CRD client","time":"2019-03-26T09:44:13Z"}``` 121 > 122 >liggitt [07:26 AM] 123 >did the CRD manifest just change? 124 125 >Katharine [3 hours ago] 126 >Not that I know of but there have been a few boskos-adjacent changes lately. Plus a k8s upgrade from 1.11 to 1.13 127 > 128 >Katharine [3 hours ago] 129 >I've gotta go for a few minutes 130 > 131 >Katharine [3 hours ago] 132 >Also -> #testing-ops 133 134 ### #testing-ops chat log 135 136 >Katharine [7:22 AM] 137 >Boskos seems to have gone down 138 Looks like it's complaining about CRDs 139 Possibly maybe related to that k8s upgrade? 140 Timing is off though 141 ```{"error":"CustomResourceDefinition.apiextensions.k8s.io \"resources.boskos.k8s.io\" is invalid: [spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion(nil): must have exactly one version marked as storage version, status.storedVersions: Invalid value: []string(nil): must have at least one stored version]","level":"fatal","msg":"unable to create a CRD client","time":"2019-03-26T09:44:13Z"}``` 142 Be right back 143 > 144 >liggitt [7:28 AM] 145 joined #testing-ops along with Draven. 146 > 147 >liggitt [7:37 AM] 148 looks like https://github.com/kubernetes/test-infra/commit/02110870507abdeb93dfd6205c70f82145f8993c#r32917192 149 > 150 >Katharine [7:39 AM] 151 That took a remarkably long time to show up 152 > 153 >liggitt [7:40 AM] 154 wouldn't have noticed until it got deployed on a new cluster with the CRD missing (edited) 155 > 156 >Katharine [7:40 AM] 157 yup 158 > 159 >liggitt [7:40 AM] 160 clearly, we need CI CI :-D 161 > 162 >chenk008 [7:41 AM] 163 joined #testing-ops. 164 > 165 >dims [7:41 AM] 166 turtles all the way down :slightly_smiling_face: 167 > 168 >liggitt [7:49 AM] 169 I opened https://github.com/kubernetes/test-infra/pull/11947 170 I'm not sure what else is required to push a new version of boskos live, so feel free to close that if there's a more complete fix needed 171 >Katharine [7:52 AM] 172 I'm working on it 173 unfortunately I'm working on it from a car with a laptop at 5% battery and no physical gnubby so it's an adventure 174 (TouchID saves the day, mostly) 175 > 176 >liggitt [7:53 AM] 177 :car: :fearful: :battery: 178 > 179 >Katharine [7:57 AM] 180 Made it to the office! 181 > 182 >Katharine [8:10 AM] 183 we should have a boskos 184 yup 185 hopefully it recovers 186 looks like the janitors are cleaning up the mess and we should be back soon enough 187 ...meanwhile I'm going to commit these changes. 188 Thanks for finding the bug @liggitt! 189 https://github.com/kubernetes/test-infra/pull/11949 could use a [retroactive] lgtm from anyone who feels like it 190 > 191 >dims [8:20 AM] 192 :lgtm: 193 > 194 >Katharine [8:21 PM] 195 Well that was an uncomfortably exciting morning. 196 197 ## #testing-ops postmortem thread 198 199 >fejta [8:23 AM] 200 Is this related to the cluster upgrade?? 201 > 202 >Katharine [8:23 AM] 203 My best guess is yes 204 > 205 >Katharine [8:24 AM] 206 There was a boskos bug introduced in December that didn't matter until 11pm last night 207 > 208 >liggitt [8:24 AM] 209 affected creation of the boskos CRD in new clusters (edited) 210 > 211 >liggitt [8:24 AM] 212 and apparently there wasn't a new cluster between December and last night 213 > 214 >fejta [8:40 AM] 215 what do you mean by new cluster? 216 > 217 >fejta [8:40 AM] 218 I guess upgrading a cluster constitutes a new cluster? 219 > 220 >liggitt [8:40 AM] 221 a cluster which did not already have the boskos CRD installed (edited) 222 > 223 >fejta [8:40 AM] 224 I upgraded the service and build clusters yesterday, but didn't create any 225 > 226 >liggitt [8:41 AM] 227 actually, looks like it would have failed even on existing clusters 228 > 229 >liggitt [8:41 AM] 230 so we were using the pre-december version of boskos prior to yesterday? 231 > 232 >fejta [8:41 AM] 233 um, no? 234 > 235 >Katharine [8:42 AM] 236 No, we were using one from last week 237 > 238 >fejta [8:42 AM] 239 The only thing I did was click the upgrade cluster button on the GKE cloud console 240 > 241 >Katharine [8:42 AM] 242 And we were still using it until I changed it this morning 243 > 244 >fejta [8:42 AM] 245 (also the cluster's nodepools) (edited) 246 > 247 >Katharine [8:42 AM] 248 It looks like it got spun up on another node when the one it was on was drained and then crash looped forever 249 > 250 >fejta [8:43 AM] 251 maybe something changed between 1.11 and 1.13? 252 > 253 >fejta [8:43 AM] 254 in how a bad CRD is handled? 255 > 256 >liggitt [8:44 AM] 257 no... the thing it was trying to create would never have been valid 258 > 259 >liggitt [8:44 AM] 260 so https://github.com/kubernetes/test-infra/pull/11576/files#diff-7eea92bcdf0c3e4ba5ad151d80562646L89 bumped boskos last week 261 > 262 >liggitt [8:44 AM] 263 from a version that would have worked (pre-december) to a version that did not work 264 > 265 >liggitt [8:44 AM] 266 and that new version was live and effective? 267 > 268 >liggitt [8:45 AM] 269 (I don't know how those manifest changes get rolled out) 270 > 271 >fejta [8:45 AM] 272 ideally we should make them auto-deploy like the rest of prow on merge 273 > 274 >Katharine [8:45 AM] 275 In boskos' case, manually 276 > 277 >liggitt [8:45 AM] 278 is it possible an old working version (boskos:v20180405-12e892d69) was still running until the nodes got drained last night as part of the upgrade? (edited) 279 > 280 >liggitt [8:45 AM] 281 ah 282 > 283 >fejta [8:46 AM] 284 what would happen if we applied the change last week that was bad? 285 > 286 >Katharine [8:46 AM] 287 It's possible someone tried to bump it but it wasn't actually bumped and just kept running an old version 288 > 289 >Katharine [8:46 AM] 290 Then when the one pod it has was drained anyway it tried bringing up a new one 291 > 292 >fejta [8:47 AM] 293 Is the fact that boskos is a statefulset relevant? 294 > 295 >fejta [8:47 AM] 296 (why is it a statefulset? how does applying changes to a statefulset differ from a deployment?) 297 > 298 >Katharine [9:09 AM] 299 Checked the audit logs, pod was indeed still running `gcr.io/k8s-testimages/boskos:v20180405-12e892d69` until yesterday despite the statefulset being changed. 300 > 301 >fejta [9:11 AM] 302 IMO we should switch it to a deployment 303 > 304 >fejta [9:11 AM] 305 I don't think it is stateful so much as a deployment with a PVC 306 > 307 >Katharine [9:12 AM] 308 That is also what it looks like to me but Seb probably has the context there 309 > 310 >Katharine [9:17 AM] 311 Audit logs also indicate that we tried to update it last week, but nothing happened 312 > 313 >Katharine [9:18 AM] 314 so that completes the story