sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/tasks/automated-machine-management/healthchecking.md (about) 1 # Configure a MachineHealthCheck 2 3 ## Prerequisites 4 5 Before attempting to configure a MachineHealthCheck, you should have a working [management cluster] with at least one MachineDeployment or MachineSet deployed. 6 7 <aside class="note warning"> 8 9 <h1> Important </h1> 10 11 Please note that MachineHealthChecks currently **only** support Machines that are owned by a MachineSet or a KubeadmControlPlane. 12 Please review the [Limitations and Caveats of a MachineHealthCheck](#limitations-and-caveats-of-a-machinehealthcheck) 13 at the bottom of this page for full details of MachineHealthCheck limitations. 14 15 </aside> 16 17 ## What is a MachineHealthCheck? 18 19 A MachineHealthCheck is a resource within the Cluster API which allows users to define conditions under which Machines within a Cluster should be considered unhealthy. 20 A MachineHealthCheck is defined on a management cluster and scoped to a particular workload cluster. 21 22 When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node. 23 If any of these conditions are met for the duration of the timeout, the Machine will be remediated. 24 By default, the action of remediating a Machine should trigger a new Machine to be created to replace the failed one, but providers are allowed to plug in more sophisticated external remediation solutions. 25 26 ## Creating a MachineHealthCheck 27 28 Use the following example as a basis for creating a MachineHealthCheck for worker nodes: 29 30 ```yaml 31 apiVersion: cluster.x-k8s.io/v1beta1 32 kind: MachineHealthCheck 33 metadata: 34 name: capi-quickstart-node-unhealthy-5m 35 spec: 36 # clusterName is required to associate this MachineHealthCheck with a particular cluster 37 clusterName: capi-quickstart 38 # (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy 39 maxUnhealthy: 40% 40 # (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for 41 # a Node to join the cluster, before considering a Machine unhealthy. 42 # Defaults to 10 minutes if not specified. 43 # Set to 0 to disable the node startup timeout. 44 # Disabling this timeout will prevent a Machine from being considered unhealthy when 45 # the Node it created has not yet registered with the cluster. This can be useful when 46 # Nodes take a long time to start up or when you only want condition based checks for 47 # Machine health. 48 nodeStartupTimeout: 10m 49 # selector is used to determine which Machines should be health checked 50 selector: 51 matchLabels: 52 nodepool: nodepool-0 53 # Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its timeout, the Machine is considered unhealthy 54 unhealthyConditions: 55 - type: Ready 56 status: Unknown 57 timeout: 300s 58 - type: Ready 59 status: "False" 60 timeout: 300s 61 ``` 62 63 Use this example as the basis for defining a MachineHealthCheck for control plane nodes managed via 64 the KubeadmControlPlane: 65 66 ```yaml 67 apiVersion: cluster.x-k8s.io/v1beta1 68 kind: MachineHealthCheck 69 metadata: 70 name: capi-quickstart-kcp-unhealthy-5m 71 spec: 72 clusterName: capi-quickstart 73 maxUnhealthy: 100% 74 selector: 75 matchLabels: 76 cluster.x-k8s.io/control-plane: "" 77 unhealthyConditions: 78 - type: Ready 79 status: Unknown 80 timeout: 300s 81 - type: Ready 82 status: "False" 83 timeout: 300s 84 ``` 85 86 <aside class="note warning"> 87 88 <h1> Important </h1> 89 90 If you are defining more than one `MachineHealthCheck` for the same Cluster, make sure that the selectors **do not overlap** 91 in order to prevent conflicts or unexpected behaviors when trying to remediate the same set of machines. 92 93 </aside> 94 95 ## Controlling remediation retries 96 97 <aside class="note warning"> 98 99 <h1> Important </h1> 100 101 This feature is only available for KubeadmControlPlane. 102 103 </aside> 104 105 KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`; 106 this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,or for allowing the infrastructure provider to stabilize in case of temporary problems. 107 108 ```yaml 109 apiVersion: cluster.x-k8s.io/v1beta1 110 kind: KubeadmControlPlane 111 metadata: 112 name: my-control-plane 113 spec: 114 ... 115 remediationStrategy: 116 maxRetry: 5 117 retryPeriod: 2m 118 minHealthyPeriod: 2h 119 ``` 120 121 `maxRetry` is the maximum number of retries while attempting to remediate an unhealthy machine. 122 A retry happens when a machine that was created as a replacement for an unhealthy machine also fails. 123 For example, given a control plane with three machines M1, M2, M3: 124 125 - M1 become unhealthy; remediation happens, and M1-1 is created as a replacement. 126 - If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be 127 remediated. This operation is considered a retry - remediation-retry #1. 128 - If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc. 129 130 A retry will only happen after the `retryPeriod` from the previous retry has elapsed. If `retryPeriod` is not set (default), a retry will happen immediately. 131 132 If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) has passed since the previous remediation this is no longer considered a retry because the new issue is assumed unrelated from the previous one. 133 134 If `maxRetry` is not set (default), remediation will be retried infinitely. 135 136 <aside class="note"> 137 138 <h1> Retry again once maxRetry is exhausted</h1> 139 140 If for some reasons you want to remediate once maxRetry is exhausted there are two options: 141 - Temporarily increase `maxRetry` (recommended) 142 - Remove the `controlplane.cluster.x-k8s.io/remediation-for` annotation from the unhealthy machine or decrease `retryCount` in the annotation value. 143 144 </aside> 145 146 ## Remediation Short-Circuiting 147 148 To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy, 149 short-circuiting is implemented to prevent further remediation via the `maxUnhealthy` and `unhealthyRange` fields within the MachineHealthCheck spec. 150 151 ### Max Unhealthy 152 153 If the user defines a value for the `maxUnhealthy` field (either an absolute number or a percentage of the total Machines checked by this MachineHealthCheck), 154 before remediating any Machines, the MachineHealthCheck will compare the value of `maxUnhealthy` with the number of Machines it has determined to be unhealthy. 155 If the number of unhealthy Machines exceeds the limit set by `maxUnhealthy`, remediation will **not** be performed. 156 157 <aside class="note warning"> 158 159 <h1> Warning </h1> 160 161 The default value for `maxUnhealthy` is `100%`. 162 This means the short circuiting mechanism is **disabled by default** and Machines will be remediated no matter the state of the cluster. 163 164 </aside> 165 166 #### With an Absolute Value 167 168 If `maxUnhealthy` is set to `2`: 169 - If 2 or fewer nodes are unhealthy, remediation will be performed 170 - If 3 or more nodes are unhealthy, remediation will not be performed 171 172 These values are independent of how many Machines are being checked by the MachineHealthCheck. 173 174 #### With Percentages 175 176 If `maxUnhealthy` is set to `40%` and there are 25 Machines being checked: 177 - If 10 or fewer nodes are unhealthy, remediation will be performed 178 - If 11 or more nodes are unhealthy, remediation will not be performed 179 180 If `maxUnhealthy` is set to `40%` and there are 6 Machines being checked: 181 - If 2 or fewer nodes are unhealthy, remediation will be performed 182 - If 3 or more nodes are unhealthy, remediation will not be performed 183 184 Note, when the percentage is not a whole number, the allowed number is rounded down. 185 186 ### Unhealthy Range 187 188 If the user defines a value for the `unhealthyRange` field (bracketed values that specify a start and an end value), before remediating any Machines, 189 the MachineHealthCheck will check if the number of Machines it has determined to be unhealthy is within the range specified by `unhealthyRange`. 190 If it is not within the range set by `unhealthyRange`, remediation will **not** be performed. 191 192 <aside class="note warning"> 193 194 <h1> Important </h1> 195 196 If both `maxUnhealthy` and `unhealthyRange` are specified, `unhealthyRange` takes precedence. 197 198 </aside> 199 200 #### With a range of values 201 202 If `unhealthyRange` is set to `[3-5]` and there are 10 Machines being checked: 203 - If 2 or fewer nodes are unhealthy, remediation will not be performed. 204 - If 6 or more nodes are unhealthy, remediation will not be performed. 205 - In all other cases, remediation will be performed. 206 207 Note, the above example had 10 machines as sample set. But, this would work the same way for any other number. 208 This is useful for dynamically scaling clusters where the number of machines keep changing frequently. 209 210 ## Skipping Remediation 211 212 There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using `clusterctl move`). For such cases, MachineHealthCheck provides 2 mechanisms to skip machines for remediation. 213 214 Implicit skipping when the resource is paused (using `cluster.x-k8s.io/paused` annotation): 215 - When a cluster is paused, none of the machines in that cluster are considered for remediation. 216 - When a machine is paused, only that machine is not considered for remediation. 217 - A cluster or a machine is usually paused automatically by Cluster API when it detects a migration. 218 219 Explicit skipping using `cluster.x-k8s.io/skip-remediation` annotation: 220 - Users can also skip any machine for remediation by setting the `cluster.x-k8s.io/skip-remediation` for that machine. 221 222 ## Limitations and Caveats of a MachineHealthCheck 223 224 Before deploying a MachineHealthCheck, please familiarise yourself with the following limitations and caveats: 225 226 - Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment) 227 - Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate) 228 - The following rules should be satisfied in order to start remediation of a control plane machine: 229 - One of the following apply: 230 - The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state) 231 - The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated. 232 - Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created. 233 - The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state. 234 - Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd) 235 - If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately 236 - If no Node joins the cluster for a Machine after the `NodeStartupTimeout`, the Machine will be remediated 237 - If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately 238 239 <!-- links --> 240 [management cluster]: ../../reference/glossary.md#management-cluster