sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20191030-machine-health-checking.md (about) 1 --- 2 title: Machine health checking a.k.a node auto repair 3 authors: 4 - "@enxebre" 5 - "@bison" 6 - "@benmoss" 7 reviewers: 8 - "@detiber" 9 - "@vincepri" 10 - "@ncdc" 11 - "@timothysc" 12 creation-date: 2019-10-30 13 last-updated: 2021-01-28 14 status: implementable 15 see-also: 16 replaces: 17 superseded-by: 18 --- 19 20 # Title 21 - Machine health checking a.k.a node auto repair 22 23 ## Table of Contents 24 25 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 26 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 27 28 - [Glossary](#glossary) 29 - [Summary](#summary) 30 - [Motivation](#motivation) 31 - [Goals](#goals) 32 - [Non-Goals/Future Work](#non-goalsfuture-work) 33 - [Proposal](#proposal) 34 - [Unhealthy criteria:](#unhealthy-criteria) 35 - [Remediation:](#remediation) 36 - [Conditions VS External Remediation](#conditions-vs-external-remediation) 37 - [User Stories](#user-stories) 38 - [Story 1](#story-1) 39 - [Story 2](#story-2) 40 - [Story 3 (external remediation)](#story-3-external-remediation) 41 - [Story 4 (external remediation)](#story-4-external-remediation) 42 - [Story 5 (external remediation)](#story-5-external-remediation) 43 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 44 - [MachineHealthCheck CRD:](#machinehealthcheck-crd) 45 - [Machine conditions:](#machine-conditions) 46 - [External Remediation](#external-remediation) 47 - [Example CRs](#example-crs) 48 - [MachineHealthCheck controller:](#machinehealthcheck-controller) 49 - [Risks and Mitigations](#risks-and-mitigations) 50 - [Contradictory signal](#contradictory-signal) 51 - [Alternatives](#alternatives) 52 - [Upgrade Strategy](#upgrade-strategy) 53 - [Additional Details](#additional-details) 54 - [Test Plan [optional]](#test-plan-optional) 55 - [Graduation Criteria [optional]](#graduation-criteria-optional) 56 - [Version Skew Strategy [optional]](#version-skew-strategy-optional) 57 - [Implementation History](#implementation-history) 58 59 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 60 61 ## Glossary 62 Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). 63 64 ## Summary 65 Enable opt in automated health checking and remediation of unhealthy nodes backed by machines. 66 67 ## Motivation 68 - Reduce administrative overhead to run a cluster. 69 - Increase ability to respond to failures of machines and keep the cluster nodes healthy. 70 71 ### Goals 72 - Enable automated remediation for groups of machines/nodes (e.g a machineSet) 73 - Allow users to define different health criteria based on node conditions for different groups of nodes. 74 - Provide a means for the cluster administrator to configure thresholds for disabling automated remediation when multiple nodes are unhealthy at the same time. 75 - Facilitate rapid experimentation by creating the ability to define customized remediation flows outside of the Machine Health Check and CAPI codebase. 76 77 ### Non-Goals/Future Work 78 - Record long-term stable history of all health-check failures or remediations. 79 - Provide a mechanism to guarantee that application quorum for N members is maintained at any time. 80 - Create an escalation path from failed external remediation attempts to machine deletion. 81 - Provide a finalizer-like pre-hook mechanism for allowing customization or blocking of the remediation process prior to power-cycling a machine or deleting it. This concept may already be covered as part of a separate enhancement. 82 83 84 ## Proposal 85 The machine health checker (MHC) is responsible for marking machines backing unhealthy Nodes. 86 87 MHC requests a remediation in one of the following ways: 88 - Applying a Condition which the owning controller consumes to remediate the machine (default) 89 - Creating a CR based on a template which signals external component to remediate the machine 90 91 It provides a short-circuit mechanism and limits remediation when the number of unhealthy machines is not within `unhealthyRange`, or has reached `maxUnhealthy` threshold for a targeted group of machines with `unhealthyRange` taking precedence. 92 This is similar to what the node life cycle controller does for reducing the eviction rate as nodes become unhealthy in a given zone. E.g. a large number of nodes in a single zone are down due to a networking issue. 93 94 The machine health checker is an integration point between node problem detection tooling expressed as node conditions and remediation to achieve a node auto repairing feature. 95 96 ### Unhealthy criteria: 97 A machine is unhealthy when: 98 - The referenced node meets the unhealthy node conditions criteria defined. 99 - The Machine has no nodeRef. 100 - The Machine has a nodeRef but the referenced node is not found. 101 102 If any of those criteria are met for longer than the given timeouts and the number of unhealthy machines is either within the `unhealthyRange` if specified, or has not reached `maxUnhealthy` threshold, the machine will be marked as failing the healthcheck. 103 104 Timeouts: 105 - For the node conditions the time outs are defined by the admin. 106 - For a machine with no nodeRef an opinionated value could be assumed e.g. 10 min. 107 108 ### Remediation: 109 - Remediation is not an integral part or responsibility of MachineHealthCheck. This controller only functions as a means for others to act when a Machine is unhealthy in the best way possible. 110 111 #### Conditions VS External Remediation 112 113 The conditions based remediation doesn’t offer any other remediation than deleting an unhealthy Machine and replacing it with a new one. 114 115 Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes. 116 117 Other environments and vendors also have specific remediation requirements, so there is a need to provide a generic mechanism for implementing custom remediation logic. 118 119 ### User Stories 120 121 #### Story 1 122 - As a user of a Workload Cluster, I only care about my app's availability, so I want my cluster infrastructure to be self-healing and the nodes to be remediated transparently in the case of failures. 123 124 #### Story 2 125 As an operator of a Management Cluster, I want my machines to be self-healing and to be recreated, resulting in a new healthy node in the case of matching my unhealthy criteria. 126 127 #### Story 3 (external remediation) 128 As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover from transient errors faster and begin application recovery sooner. 129 130 #### Story 4 (external remediation) 131 As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect non-transient issues faster. 132 133 #### Story 5 (external remediation) 134 As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, so that they are automatically added back to the cluster when I fix the underlying problem. 135 136 ### Implementation Details/Notes/Constraints 137 138 #### MachineHealthCheck CRD: 139 - Enable watching a group of machines (based on a label selector). 140 - Enable defining an unhealthy node criteria (based on a list of node conditions). 141 - Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool. 142 143 144 E.g: 145 - I want a machine to be remediated when its associated node has `ready=false` or `ready=Unknown` condition for more than 5m. 146 - I want to disable auto-remediation if 40% or more of the matching machines are unhealthy. 147 148 ```yaml 149 apiVersion: cluster.x-k8s.io/v1alpha3 150 kind: MachineHealthCheck 151 metadata: 152 name: example 153 namespace: machine-api 154 spec: 155 selector: 156 matchLabels: 157 role: worker 158 unhealthyConditions: 159 - type: "Ready" 160 status: "Unknown" 161 timeout: "5m" 162 - type: "Ready" 163 status: "False" 164 timeout: "5m" 165 maxUnhealthy: "40%" 166 status: 167 currentHealthy: 5 168 expectedMachines: 5 169 ``` 170 171 #### Machine conditions: 172 173 ```go 174 const ConditionHealthCheckSucceeded ConditionType = "HealthCheckSucceeded" 175 const ConditionOwnerRemediated ConditionType = "OwnerRemediated" 176 ``` 177 178 - Both of these conditions are applied by the MHC controller. HealthCheckSucceeded should only be updated by the MHC after running the health check. 179 - If a health check passes after it has failed, the conditions will not be updated. When in-place remediation is needed we can address the challenges around this. 180 - OwnerRemediated is set to False after a health check fails, but should be changed to True by the owning controller after remediation succeeds. 181 - If remediation fails OwnerRemediated can be updated to a higher severity and the reason can be updated to aid in troubleshooting. 182 183 This is the default remediation strategy. 184 185 #### External Remediation 186 187 A generic mechanism for supporting externally provided custom remediation strategies. 188 189 We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD. 190 191 If no value for remediationTemplate is defined for the MachineHealthCheck CR, the condition-based flow is preserved. 192 193 If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR. 194 195 No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR. 196 197 ```go 198 type MachineHealthCheckSpec struct { 199 ... 200 201 // +optional 202 RemediationTemplate *ObjectReference `json:"remediationTemplate,omitempty"` 203 } 204 ``` 205 206 When a Machine enters an unhealthy state, the MHC will: 207 * Look up the referenced template 208 * Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR) 209 * Force the name and namespace to match the unhealthy Machine 210 * Save the new object in etcd 211 212 We use the same name and namespace for the External Machine Remediation CR to ensure uniqueness and lessen the possibility for multiple parallel remediations of the same Machine. 213 214 The lifespan of the EMRs is that of the remediation process and they are not intended to be a record of past events. 215 The EMR will also contain an ownerRef to the Machine, to ensure that it does not outlive the Machine it references. 216 217 The only signaling between the MHC and the external controller watching EMR CRs is the creation and deletion of the EMR itself. 218 Any actions or changes that admins should be informed about should be emitted as events for consoles and UIs to consume if necessary. 219 They are informational only and do not result in or expect any behaviour from the MHC, Node, or Machine as a result. 220 221 When the external remediation controller detects the new EMR it starts remediation and performs whatever actions it deems appropriate until the EMR is deleted by the MHC. 222 It is a detail of the ERC when and how to retry remediation in the event that a EMR is not deleted after the ERC considers remediation complete. 223 224 The ERC may wish to register a finalizer on its CR to ensure it has an opportunity to perform any additional cleanup in the case that the unhealthy state was transient and the Node returned to a healthy state prior to the completion of the full custom ERC flow. 225 226 ##### Example CRs 227 228 MachineHealthCheck: 229 ```yaml 230 kind: MachineHealthCheck 231 apiVersion: cluster.x-k8s.io/v1alphaX 232 metadata: 233 name: REMEDIATION_GROUP 234 namespace: NAMESPACE_OF_UNHEALTHY_MACHINE 235 spec: 236 selector: 237 matchLabels: 238 ... 239 remediationTemplate: 240 kind: Metal3RemediationTemplate 241 apiVersion: remediation.metal3.io/v1alphaX 242 name: M3_REMEDIATION_GROUP 243 ``` 244 245 Metal3RemediationTemplate: 246 ```yaml 247 kind: Metal3RemediationTemplate 248 apiVersion: remediation.metal3.io/v1alphaX 249 metadata: 250 name: M3_REMEDIATION_GROUP 251 namespace: NAMESPACE_OF_UNHEALTHY_MACHINE 252 spec: 253 template: 254 spec: 255 strategy: escalate 256 deleteAfterRetries: 10 257 powerOnTimeoutSeconds: 600 258 powerOffTimeoutSeconds: 120 259 ``` 260 261 Metal3Remediation: 262 ```yaml 263 apiVersion: remediation.metal3.io/v1alphaX 264 kind: Metal3Remediation 265 metadata: 266 name: NAME_OF_UNHEALTHY_MACHINE 267 namespace: NAMESPACE_OF_UNHEALTHY_MACHINE 268 finalizer: 269 - remediation.metal3.io 270 ownerReferences: 271 - apiVersion:cluster.x-k8s.io/v1alphaX 272 kind: Machine 273 name: NAME_OF_UNHEALTHY_MACHINE 274 uid: ... 275 spec: 276 strategy: escalate 277 deleteAfterRetries: 10 278 powerOnTimeoutSeconds: 600 279 powerOffTimeoutSeconds: 120 280 status: 281 phase: power-off 282 retryCount: 1 283 ``` 284 285 #### MachineHealthCheck controller: 286 Watch: 287 - Watch machineHealthCheck resources 288 - Watch machines and nodes with an event handler e.g controller runtime `EnqueueRequestsFromMapFunc` which returns machineHealthCheck resources. 289 290 Reconcile: 291 - Fetch all the machines targeted by the MachineHealthCheck and operate over machine/node targets. E.g: 292 ``` 293 type target struct { 294 Machine capi.Machine 295 Node *corev1.Node 296 MHC capi.MachineHealthCheck 297 } 298 ``` 299 300 - Calculate the number of unhealthy targets. 301 - Compare current number against `unhealthyRange`, if specified, and temporarily short circuit remediation if it's not within the range. 302 - If `unhealthyRange` is not specified, compare against `maxUnhealthy` threshold and temporarily short circuit remediation if the threshold is met. 303 - Either marks unhealthy target machines with conditions or create an external remediation CR as described above. 304 305 Out of band: 306 - The owning controller observes the OwnerRemediated=False condition and is responsible to remediate the machine. 307 - The owning controller performs any pre-deletion tasks required. 308 - The owning controller MUST delete the machine. 309 - The machine controller drains the unhealthy node. 310 - The machine controller provider deletes the instance. 311 312  313 314 ### Risks and Mitigations 315 316 ## Contradictory signal 317 318 If a user configures healthchecks such that more than one check applies to 319 machines these healthchecks can potentially become a source of contradictory 320 information. For instance, one healthcheck could pass for a given machine, 321 while another fails. This will result in undefined behavior since the order in 322 which the checks are run and the timing of the remediating controller will 323 determine which condition is used for remediation decisions. 324 325 There are safeguards that can be put in place to prevent this but it is out of 326 scope for this proposal. Documentation should be updated to warn users to 327 ensure their healthchecks do not overlap. 328 329 ## Alternatives 330 331 This proposal was originally adopted and implemented having the MHC do the 332 deletion itself. This was revised since it did not allow for more complex 333 controllers, for example the Kubeadm Control Plane, to account for the 334 remediation strategies they require. 335 336 It was also adopted in a later iteration using annotations on the machine. 337 However it was realized that this would amount to a privilege escalation 338 vector, as the `machines.edit` permission would grant effective access to 339 `machines.delete`. By using conditions, this requires `machines/status.edit`, 340 which is a less common privilege. 341 342 Considered to bake this functionality into machineSets. 343 This was discarded as different controllers than a machineSet could be owning the targeted machines. 344 For those cases as a user you still want to benefit from automated node remediation. 345 346 Considered allowing to target machineSets instead of using a label selector. 347 This was discarded because of the reason above. 348 Also there might be upper level controllers doing things that the MHC does not need to account for, e.g machineDeployment flipping machineSets for a rolling update. 349 Therefore considering machines and label selectors as the fundamental operational entity results in a good and convenient level of flexibility and decoupling from other controllers. 350 351 Considered a more strict short-circuiting mechanism decoupled from the machine health checker i.e machine disruption budget analogous to pod disruption budget. 352 This was discarded because it added an unjustified level of complexity and additional API resources. 353 Instead we opt for a simpler approach and will consider RFE that requires additional complexity based on real use feedback. 354 355 Considered using conditions instead of external remediation. 356 The existing design doesn't solve all the problems required to have full external remediation that doesn't necessarily deletes the Machine at the end of remediation, but this can be modified and expanded. 357 358 We could give up on the idea of a shared MHC and instead focus on providing a common library for deciding if a Node is healthy. This creates a situation where we could end up with baremetal, Master, and Ceph MHC variants that somehow need to duplicate or coordinate their activities. 359 360 ## Upgrade Strategy 361 This is an opt in feature supported by a new CRD and controller so there is no need to handle upgrades for existing clusters. 362 363 External remediation feature adds A new optional field will be added to MachineHealthCheck CRD, so existing CRs will be still valid and will function as they previously did. 364 The templates are all new, and opt-in feature so upgrade has no impact in this case. 365 366 ## Additional Details 367 368 ### Test Plan [optional] 369 Extensive unit testing for all the cases supported when remediating. 370 371 e2e testing as part of the cluster-api e2e test suite. 372 373 For failing early testing we could consider a test suite leveraging kubemark as a provider to simulate healthy/unhealthy nodes in a cloud agnostic manner without the need to bring up a real instance. 374 375 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 376 377 ### Graduation Criteria [optional] 378 This propose the new CRD to belong to the same API group than other cluster-api resources, e.g. machine, machineSet and to follow the same release cadence. 379 380 ### Version Skew Strategy [optional] 381 382 ## Implementation History 383 - [x] 10/30/2019: Opened proposal PR 384 - [x] 04/15/2020: Revised to move reconciliation to owner controller 385 - [x] 08/04/2020: Added external remediation 386 387 <!-- Links --> 388 [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY