sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/tasks/automated-machine-management/healthchecking.md

sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/tasks/automated-machine-management/healthchecking.md (about)

     1  # Configure a MachineHealthCheck
     2  
     3  ## Prerequisites
     4  
     5  Before attempting to configure a MachineHealthCheck, you should have a working [management cluster] with at least one MachineDeployment or MachineSet deployed.
     6  
     7  <aside class="note warning">
     8  
     9  <h1> Important </h1>
    10  
    11  Please note that MachineHealthChecks currently **only** support Machines that are owned by a MachineSet or a KubeadmControlPlane.
    12  Please review the [Limitations and Caveats of a MachineHealthCheck](#limitations-and-caveats-of-a-machinehealthcheck)
    13  at the bottom of this page for full details of MachineHealthCheck limitations.
    14  
    15  </aside>
    16  
    17  ## What is a MachineHealthCheck?
    18  
    19  A MachineHealthCheck is a resource within the Cluster API which allows users to define conditions under which Machines within a Cluster should be considered unhealthy.
    20  A MachineHealthCheck is defined on a management cluster and scoped to a particular workload cluster.
    21  
    22  When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node.
    23  If any of these conditions are met for the duration of the timeout, the Machine will be remediated.
    24  By default, the action of remediating a Machine should trigger a new Machine to be created to replace the failed one, but providers are allowed to plug in more sophisticated external remediation solutions.
    25  
    26  ## Creating a MachineHealthCheck
    27  
    28  Use the following example as a basis for creating a MachineHealthCheck for worker nodes:
    29  
    30  ```yaml
    31  apiVersion: cluster.x-k8s.io/v1beta1
    32  kind: MachineHealthCheck
    33  metadata:
    34    name: capi-quickstart-node-unhealthy-5m
    35  spec:
    36    # clusterName is required to associate this MachineHealthCheck with a particular cluster
    37    clusterName: capi-quickstart
    38    # (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy
    39    maxUnhealthy: 40%
    40    # (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for
    41    # a Node to join the cluster, before considering a Machine unhealthy.
    42    # Defaults to 10 minutes if not specified.
    43    # Set to 0 to disable the node startup timeout.
    44    # Disabling this timeout will prevent a Machine from being considered unhealthy when
    45    # the Node it created has not yet registered with the cluster. This can be useful when
    46    # Nodes take a long time to start up or when you only want condition based checks for
    47    # Machine health.
    48    nodeStartupTimeout: 10m
    49    # selector is used to determine which Machines should be health checked
    50    selector:
    51      matchLabels:
    52        nodepool: nodepool-0
    53    # Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its timeout, the Machine is considered unhealthy
    54    unhealthyConditions:
    55    - type: Ready
    56      status: Unknown
    57      timeout: 300s
    58    - type: Ready
    59      status: "False"
    60      timeout: 300s
    61  ```
    62  
    63  Use this example as the basis for defining a MachineHealthCheck for control plane nodes managed via
    64  the KubeadmControlPlane:
    65  
    66  ```yaml
    67  apiVersion: cluster.x-k8s.io/v1beta1
    68  kind: MachineHealthCheck
    69  metadata:
    70    name: capi-quickstart-kcp-unhealthy-5m
    71  spec:
    72    clusterName: capi-quickstart
    73    maxUnhealthy: 100%
    74    selector:
    75      matchLabels:
    76        cluster.x-k8s.io/control-plane: ""
    77    unhealthyConditions:
    78      - type: Ready
    79        status: Unknown
    80        timeout: 300s
    81      - type: Ready
    82        status: "False"
    83        timeout: 300s
    84  ```
    85  
    86  <aside class="note warning">
    87  
    88  <h1> Important </h1>
    89  
    90  If you are defining more than one `MachineHealthCheck` for the same Cluster, make sure that the selectors **do not overlap**
    91  in order to prevent conflicts or unexpected behaviors when trying to remediate the same set of machines.
    92  
    93  </aside>
    94  
    95  ## Controlling remediation retries
    96  
    97  <aside class="note warning">
    98  
    99  <h1> Important </h1>
   100  
   101  This feature is only available for KubeadmControlPlane.
   102  
   103  </aside>
   104  
   105  KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`;
   106  this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,or for allowing the infrastructure provider to stabilize in case of temporary problems.
   107  
   108  ```yaml
   109  apiVersion: cluster.x-k8s.io/v1beta1
   110  kind: KubeadmControlPlane
   111  metadata:
   112    name: my-control-plane
   113  spec:
   114    ...
   115    remediationStrategy:
   116      maxRetry: 5
   117      retryPeriod: 2m
   118      minHealthyPeriod: 2h
   119  ```
   120  
   121  `maxRetry` is the maximum number of retries while attempting to remediate an unhealthy machine.
   122  A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
   123  For example, given a control plane with three machines M1, M2, M3:
   124  
   125  - M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
   126  - If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be 
   127    remediated. This operation is considered a retry - remediation-retry #1.
   128  - If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
   129  
   130  A retry will only happen after the `retryPeriod` from the previous retry has elapsed. If `retryPeriod` is not set (default), a retry will happen immediately.
   131  
   132  If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) has passed since the previous remediation this is no longer considered a retry because the new issue is assumed unrelated from the previous one.
   133  
   134  If `maxRetry` is not set (default), remediation will be retried infinitely.
   135  
   136  <aside class="note">
   137  
   138  <h1> Retry again once maxRetry is exhausted</h1>
   139  
   140  If for some reasons you want to remediate once maxRetry is exhausted there are two options:
   141  - Temporarily increase  `maxRetry` (recommended)
   142  - Remove the `controlplane.cluster.x-k8s.io/remediation-for` annotation from the unhealthy machine or decrease `retryCount` in the annotation value.
   143  
   144  </aside>
   145  
   146  ## Remediation Short-Circuiting
   147  
   148  To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy,
   149  short-circuiting is implemented to prevent further remediation via the `maxUnhealthy` and `unhealthyRange` fields within the MachineHealthCheck spec.
   150  
   151  ### Max Unhealthy
   152  
   153  If the user defines a value for the `maxUnhealthy` field (either an absolute number or a percentage of the total Machines checked by this MachineHealthCheck),
   154  before remediating any Machines, the MachineHealthCheck will compare the value of `maxUnhealthy` with the number of Machines it has determined to be unhealthy.
   155  If the number of unhealthy Machines exceeds the limit set by `maxUnhealthy`, remediation will **not** be performed.
   156  
   157  <aside class="note warning">
   158  
   159  <h1> Warning </h1>
   160  
   161  The default value for `maxUnhealthy` is `100%`.
   162  This means the short circuiting mechanism is **disabled by default** and Machines will be remediated no matter the state of the cluster.
   163  
   164  </aside>
   165  
   166  #### With an Absolute Value
   167  
   168  If `maxUnhealthy` is set to `2`:
   169  - If 2 or fewer nodes are unhealthy, remediation will be performed
   170  - If 3 or more nodes are unhealthy, remediation will not be performed
   171  
   172  These values are independent of how many Machines are being checked by the MachineHealthCheck.
   173  
   174  #### With Percentages
   175  
   176  If `maxUnhealthy` is set to `40%` and there are 25 Machines being checked:
   177  - If 10 or fewer nodes are unhealthy, remediation will be performed
   178  - If 11 or more nodes are unhealthy, remediation will not be performed
   179  
   180  If `maxUnhealthy` is set to `40%` and there are 6 Machines being checked:
   181  - If 2 or fewer nodes are unhealthy, remediation will be performed
   182  - If 3 or more nodes are unhealthy, remediation will not be performed
   183  
   184  Note, when the percentage is not a whole number, the allowed number is rounded down.
   185  
   186  ### Unhealthy Range
   187  
   188  If the user defines a value for the `unhealthyRange` field (bracketed values that specify a start and an end value), before remediating any Machines,
   189  the MachineHealthCheck will check if the number of Machines it has determined to be unhealthy is within the range specified by `unhealthyRange`.
   190  If it is not within the range set by `unhealthyRange`, remediation will **not** be performed.
   191  
   192  <aside class="note warning">
   193  
   194  <h1> Important </h1>
   195  
   196  If both `maxUnhealthy` and `unhealthyRange` are specified, `unhealthyRange` takes precedence.
   197  
   198  </aside>
   199  
   200  #### With a range of values
   201  
   202  If `unhealthyRange` is set to `[3-5]` and there are 10 Machines being checked:
   203  - If 2 or fewer nodes are unhealthy, remediation will not be performed.
   204  - If 6 or more nodes are unhealthy, remediation will not be performed.
   205  - In all other cases, remediation will be performed.
   206  
   207  Note, the above example had 10 machines as sample set. But, this would work the same way for any other number.
   208  This is useful for dynamically scaling clusters where the number of machines keep changing frequently.
   209  
   210  ## Skipping Remediation
   211  
   212  There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using `clusterctl move`). For such cases, MachineHealthCheck provides 2 mechanisms to skip machines for remediation.
   213  
   214  Implicit skipping when the resource is paused (using `cluster.x-k8s.io/paused` annotation):
   215  - When a cluster is paused, none of the machines in that cluster are considered for remediation.
   216  - When a machine is paused, only that machine is not considered for remediation.
   217  - A cluster or a machine is usually paused automatically by Cluster API when it detects a migration.
   218  
   219  Explicit skipping using `cluster.x-k8s.io/skip-remediation` annotation:
   220  - Users can also skip any machine for remediation by setting the `cluster.x-k8s.io/skip-remediation` for that machine.
   221  
   222  ## Limitations and Caveats of a MachineHealthCheck
   223  
   224  Before deploying a MachineHealthCheck, please familiarise yourself with the following limitations and caveats:
   225  
   226  - Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment)
   227  - Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate)
   228    - The following rules should be satisfied in order to start remediation of a control plane machine:
   229      - One of the following apply:
   230          - The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
   231          - The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
   232      - Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created.
   233      - The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
   234      - Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd)
   235  - If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
   236  - If no Node joins the cluster for a Machine after the `NodeStartupTimeout`, the Machine will be remediated
   237  - If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately
   238  
   239  <!-- links -->
   240  [management cluster]: ../../reference/glossary.md#management-cluster