sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20191030-machine-health-checking.md (about)

     1  ---
     2  title: Machine health checking a.k.a node auto repair
     3  authors:
     4    - "@enxebre"
     5    - "@bison"
     6    - "@benmoss"
     7  reviewers:
     8    - "@detiber"
     9    - "@vincepri"
    10    - "@ncdc"
    11    - "@timothysc"
    12  creation-date: 2019-10-30
    13  last-updated: 2021-01-28
    14  status: implementable
    15  see-also:
    16  replaces:
    17  superseded-by:
    18  ---
    19  
    20  # Title
    21  - Machine health checking a.k.a node auto repair
    22  
    23  ## Table of Contents
    24  
    25  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    26  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    27  
    28  - [Glossary](#glossary)
    29  - [Summary](#summary)
    30  - [Motivation](#motivation)
    31    - [Goals](#goals)
    32    - [Non-Goals/Future Work](#non-goalsfuture-work)
    33  - [Proposal](#proposal)
    34    - [Unhealthy criteria:](#unhealthy-criteria)
    35    - [Remediation:](#remediation)
    36      - [Conditions VS External Remediation](#conditions-vs-external-remediation)
    37    - [User Stories](#user-stories)
    38      - [Story 1](#story-1)
    39      - [Story 2](#story-2)
    40      - [Story 3 (external remediation)](#story-3-external-remediation)
    41      - [Story 4 (external remediation)](#story-4-external-remediation)
    42      - [Story 5 (external remediation)](#story-5-external-remediation)
    43    - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
    44      - [MachineHealthCheck CRD:](#machinehealthcheck-crd)
    45      - [Machine conditions:](#machine-conditions)
    46      - [External Remediation](#external-remediation)
    47        - [Example CRs](#example-crs)
    48      - [MachineHealthCheck controller:](#machinehealthcheck-controller)
    49    - [Risks and Mitigations](#risks-and-mitigations)
    50  - [Contradictory signal](#contradictory-signal)
    51  - [Alternatives](#alternatives)
    52  - [Upgrade Strategy](#upgrade-strategy)
    53  - [Additional Details](#additional-details)
    54    - [Test Plan [optional]](#test-plan-optional)
    55    - [Graduation Criteria [optional]](#graduation-criteria-optional)
    56    - [Version Skew Strategy [optional]](#version-skew-strategy-optional)
    57  - [Implementation History](#implementation-history)
    58  
    59  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    60  
    61  ## Glossary
    62  Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
    63  
    64  ## Summary
    65  Enable opt in automated health checking and remediation of unhealthy nodes backed by machines.
    66  
    67  ## Motivation
    68  - Reduce administrative overhead to run a cluster.
    69  - Increase ability to respond to failures of machines and keep the cluster nodes healthy.
    70  
    71  ### Goals
    72  - Enable automated remediation for groups of machines/nodes (e.g a machineSet)
    73  - Allow users to define different health criteria based on node conditions for different groups of nodes.
    74  - Provide a means for the cluster administrator to configure thresholds for disabling automated remediation when multiple nodes are unhealthy at the same time.
    75  - Facilitate rapid experimentation by creating the ability to define customized remediation flows outside of the Machine Health Check and CAPI codebase.
    76  
    77  ### Non-Goals/Future Work
    78  - Record long-term stable history of all health-check failures or remediations.
    79  - Provide a mechanism to guarantee that application quorum for N members is maintained at any time.
    80  - Create an escalation path from failed external remediation attempts to machine deletion.
    81  - Provide a finalizer-like pre-hook mechanism for allowing customization or blocking of the remediation process prior to power-cycling a machine or deleting it. This concept may already be covered as part of a separate enhancement.
    82  
    83  
    84  ## Proposal
    85  The machine health checker (MHC) is responsible for marking machines backing unhealthy Nodes.
    86  
    87  MHC requests a remediation in one of the following ways:
    88  - Applying a Condition which the owning controller consumes to remediate the machine (default) 
    89  - Creating a CR based on a template which signals external component to remediate the machine 
    90  
    91  It provides a short-circuit mechanism and limits remediation when the number of unhealthy machines is not within `unhealthyRange`, or has reached `maxUnhealthy` threshold for a targeted group of machines with `unhealthyRange` taking precedence.
    92  This is similar to what the node life cycle controller does for reducing the eviction rate as nodes become unhealthy in a given zone. E.g. a large number of nodes in a single zone are down due to a networking issue.
    93  
    94  The machine health checker is an integration point between node problem detection tooling expressed as node conditions and remediation to achieve a node auto repairing feature.
    95  
    96  ### Unhealthy criteria:
    97  A machine is unhealthy when:
    98  - The referenced node meets the unhealthy node conditions criteria defined.
    99  - The Machine has no nodeRef.
   100  - The Machine has a nodeRef but the referenced node is not found.
   101  
   102  If any of those criteria are met for longer than the given timeouts and the number of unhealthy machines is either within the `unhealthyRange` if specified, or has not reached `maxUnhealthy` threshold, the machine will be marked as failing the healthcheck.
   103  
   104  Timeouts:
   105  - For the node conditions the time outs are defined by the admin.
   106  - For a machine with no nodeRef an opinionated value could be assumed e.g. 10 min.
   107  
   108  ### Remediation:
   109  - Remediation is not an integral part or responsibility of MachineHealthCheck. This controller only functions as a means for others to act when a Machine is unhealthy in the best way possible.
   110  
   111  #### Conditions VS External Remediation
   112  
   113  The conditions based remediation doesn’t offer any other remediation than deleting an unhealthy Machine and replacing it with a new one.
   114  
   115  Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes.
   116  
   117  Other environments and vendors also have specific remediation requirements, so there is a need to provide a generic mechanism for implementing custom remediation logic.
   118  
   119  ### User Stories
   120  
   121  #### Story 1
   122  - As a user of a Workload Cluster, I only care about my app's availability, so I want my cluster infrastructure to be self-healing and the nodes to be remediated transparently in the case of failures.
   123  
   124  #### Story 2
   125  As an operator of a Management Cluster, I want my machines to be self-healing and to be recreated, resulting in a new healthy node in the case of matching my unhealthy criteria.
   126  
   127  #### Story 3 (external remediation)
   128  As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover from transient errors faster and begin application recovery sooner.
   129  
   130  #### Story 4 (external remediation)
   131  As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect non-transient issues faster.
   132  
   133  #### Story 5 (external remediation)
   134  As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, so that they are automatically added back to the cluster when I fix the underlying problem.
   135  
   136  ### Implementation Details/Notes/Constraints
   137  
   138  #### MachineHealthCheck CRD:
   139  - Enable watching a group of machines (based on a label selector).
   140  - Enable defining an unhealthy node criteria (based on a list of node conditions).
   141  - Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool.
   142  
   143  
   144  E.g:
   145  - I want a machine to be remediated when its associated node has `ready=false` or `ready=Unknown` condition for more than 5m.
   146  - I want to disable auto-remediation if 40% or more of the matching machines are unhealthy.
   147  
   148  ```yaml
   149  apiVersion: cluster.x-k8s.io/v1alpha3
   150  kind: MachineHealthCheck
   151  metadata:
   152    name: example
   153    namespace: machine-api
   154  spec:
   155    selector:
   156      matchLabels:
   157        role: worker
   158    unhealthyConditions:
   159    - type:    "Ready"
   160      status:  "Unknown"
   161      timeout: "5m"
   162    - type:    "Ready"
   163      status:  "False"
   164      timeout: "5m"
   165    maxUnhealthy: "40%"
   166  status:
   167    currentHealthy: 5
   168    expectedMachines: 5
   169  ```
   170  
   171  #### Machine conditions:
   172  
   173  ```go
   174  const ConditionHealthCheckSucceeded ConditionType = "HealthCheckSucceeded"
   175  const ConditionOwnerRemediated ConditionType = "OwnerRemediated"
   176  ```
   177  
   178  - Both of these conditions are applied by the MHC controller. HealthCheckSucceeded should only be updated by the MHC after running the health check.
   179  - If a health check passes after it has failed, the conditions will not be updated. When in-place remediation is needed we can address the challenges around this.
   180  - OwnerRemediated is set to False after a health check fails, but should be changed to True by the owning controller after remediation succeeds.
   181  - If remediation fails OwnerRemediated can be updated to a higher severity and the reason can be updated to aid in troubleshooting.
   182  
   183  This is the default remediation strategy.
   184  
   185  #### External Remediation
   186  
   187  A generic mechanism for supporting externally provided custom remediation strategies.
   188  
   189  We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD.
   190  
   191  If no value for remediationTemplate is defined for the MachineHealthCheck CR, the condition-based flow is preserved.
   192  
   193  If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.
   194  
   195  No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR.
   196  
   197  ```go
   198      type MachineHealthCheckSpec struct {
   199          ...
   200      
   201          // +optional
   202          RemediationTemplate *ObjectReference `json:"remediationTemplate,omitempty"`
   203      }
   204  ```
   205  
   206  When a Machine enters an unhealthy state, the MHC will:
   207  * Look up the referenced template
   208  * Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR)
   209  * Force the name and namespace to match the unhealthy Machine
   210  * Save the new object in etcd
   211  
   212  We use the same name and namespace for the External Machine Remediation CR to ensure uniqueness and lessen the possibility for multiple parallel remediations of the same Machine. 
   213  
   214  The lifespan of the EMRs is that of the remediation process and they are not intended to be a record of past events.
   215  The EMR will also contain an ownerRef to the Machine, to ensure that it does not outlive the Machine it references.
   216  
   217  The only signaling between the MHC and the external controller watching EMR CRs is the creation and deletion of the EMR itself.
   218  Any actions or changes that admins should be informed about should be emitted as events for consoles and UIs to consume if necessary.
   219  They are informational only and do not result in or expect any behaviour from the MHC, Node, or Machine as a result.
   220  
   221  When the external remediation controller detects the new EMR it starts remediation and performs whatever actions it deems appropriate until the EMR is deleted by the MHC.
   222  It is a detail of the ERC when and how to retry remediation in the event that a EMR is not deleted after the ERC considers remediation complete. 
   223  
   224  The ERC may wish to register a finalizer on its CR to ensure it has an opportunity to perform any additional cleanup in the case that the unhealthy state was transient and the Node returned to a healthy state prior to the completion of the full custom ERC flow.
   225  
   226  ##### Example CRs
   227  
   228  MachineHealthCheck:
   229  ```yaml
   230      kind: MachineHealthCheck
   231      apiVersion: cluster.x-k8s.io/v1alphaX
   232      metadata:
   233        name: REMEDIATION_GROUP
   234        namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
   235      spec:
   236        selector:
   237          matchLabels: 
   238            ...
   239        remediationTemplate:
   240          kind: Metal3RemediationTemplate
   241          apiVersion: remediation.metal3.io/v1alphaX
   242          name: M3_REMEDIATION_GROUP
   243  ```
   244  
   245  Metal3RemediationTemplate:
   246  ```yaml
   247      kind: Metal3RemediationTemplate
   248      apiVersion: remediation.metal3.io/v1alphaX
   249      metadata:
   250        name: M3_REMEDIATION_GROUP
   251        namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
   252      spec:
   253        template:
   254          spec:
   255            strategy: 		    escalate
   256            deleteAfterRetries:     10
   257            powerOnTimeoutSeconds:  600
   258            powerOffTimeoutSeconds: 120
   259  ```
   260  
   261  Metal3Remediation:
   262  ```yaml
   263      apiVersion: remediation.metal3.io/v1alphaX
   264      kind: Metal3Remediation
   265      metadata:
   266        name: NAME_OF_UNHEALTHY_MACHINE
   267        namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
   268        finalizer:
   269        - remediation.metal3.io
   270        ownerReferences:
   271        - apiVersion:cluster.x-k8s.io/v1alphaX
   272          kind: Machine
   273          name: NAME_OF_UNHEALTHY_MACHINE
   274          uid: ...
   275      spec:
   276        strategy: 		    escalate
   277        deleteAfterRetries:     10
   278        powerOnTimeoutSeconds:  600
   279        powerOffTimeoutSeconds: 120
   280      status:
   281        phase: power-off
   282        retryCount: 1
   283  ```
   284  
   285  #### MachineHealthCheck controller:
   286  Watch:
   287  - Watch machineHealthCheck resources
   288  - Watch machines and nodes with an event handler e.g controller runtime `EnqueueRequestsFromMapFunc` which returns machineHealthCheck resources.
   289  
   290  Reconcile:
   291  - Fetch all the machines targeted by the MachineHealthCheck and operate over machine/node targets. E.g:
   292  ```
   293  type target struct {
   294    Machine capi.Machine
   295    Node    *corev1.Node
   296    MHC     capi.MachineHealthCheck
   297  }
   298  ```
   299  
   300  - Calculate the number of unhealthy targets.
   301  - Compare current number against `unhealthyRange`, if specified, and temporarily short circuit remediation if it's not within the range.
   302  - If `unhealthyRange` is not specified, compare against `maxUnhealthy` threshold and temporarily short circuit remediation if the threshold is met.
   303  - Either marks unhealthy target machines with conditions or create an external remediation CR as described above.
   304  
   305  Out of band:
   306  - The owning controller observes the OwnerRemediated=False condition and is responsible to remediate the machine.
   307  - The owning controller performs any pre-deletion tasks required.
   308  - The owning controller MUST delete the machine.
   309  - The machine controller drains the unhealthy node.
   310  - The machine controller provider deletes the instance.
   311  
   312  ![Machine health check](./images/machine-health-check/mhc.svg)
   313  
   314  ### Risks and Mitigations
   315  
   316  ## Contradictory signal
   317  
   318  If a user configures healthchecks such that more than one check applies to
   319  machines these healthchecks can potentially become a source of contradictory
   320  information. For instance, one healthcheck could pass for a given machine,
   321  while another fails. This will result in undefined behavior since the order in
   322  which the checks are run and the timing of the remediating controller will
   323  determine which condition is used for remediation decisions.
   324  
   325  There are safeguards that can be put in place to prevent this but it is out of
   326  scope for this proposal. Documentation should be updated to warn users to
   327  ensure their healthchecks do not overlap.
   328  
   329  ## Alternatives
   330  
   331  This proposal was originally adopted and implemented having the MHC do the
   332  deletion itself. This was revised since it did not allow for more complex
   333  controllers, for example the Kubeadm Control Plane, to account for the
   334  remediation strategies they require.
   335  
   336  It was also adopted in a later iteration using annotations on the machine.
   337  However it was realized that this would amount to a privilege escalation
   338  vector, as the `machines.edit` permission would grant effective access to
   339  `machines.delete`. By using conditions, this requires `machines/status.edit`,
   340  which is a less common privilege.
   341  
   342  Considered to bake this functionality into machineSets.
   343  This was discarded as different controllers than a machineSet could be owning the targeted machines.
   344  For those cases as a user you still want to benefit from automated node remediation.
   345  
   346  Considered allowing to target machineSets instead of using a label selector.
   347  This was discarded because of the reason above.
   348  Also there might be upper level controllers doing things that the MHC does not need to account for, e.g machineDeployment flipping machineSets for a rolling update.
   349  Therefore considering machines and label selectors as the fundamental operational entity results in a good and convenient level of flexibility and decoupling from other controllers.
   350  
   351  Considered a more strict short-circuiting mechanism decoupled from the machine health checker i.e machine disruption budget analogous to pod disruption budget.
   352  This was discarded because it added an unjustified level of complexity and additional API resources.
   353  Instead we opt for a simpler approach and will consider RFE that requires additional complexity based on real use feedback.
   354  
   355  Considered using conditions instead of external remediation.
   356  The existing design doesn't solve all the problems required to have full external remediation that doesn't necessarily deletes the Machine at the end of remediation, but this can be modified and expanded.
   357  
   358  We could give up on the idea of a shared MHC and instead focus on providing a common library for deciding if a Node is healthy. This creates a situation where we could end up with baremetal, Master, and Ceph MHC variants that somehow need to duplicate or coordinate their activities.
   359  
   360  ## Upgrade Strategy
   361  This is an opt in feature supported by a new CRD and controller so there is no need to handle upgrades for existing clusters. 
   362  
   363  External remediation feature adds A new optional field will be added to MachineHealthCheck CRD, so existing CRs will be still valid and will function as they previously did.
   364  The templates are all new, and opt-in feature so upgrade has no impact in this case.
   365  
   366  ## Additional Details
   367  
   368  ### Test Plan [optional]
   369  Extensive unit testing for all the cases supported when remediating.
   370  
   371  e2e testing as part of the cluster-api e2e test suite.
   372  
   373  For failing early testing we could consider a test suite leveraging kubemark as a provider to simulate healthy/unhealthy nodes in a cloud agnostic manner without the need to bring up a real instance.
   374  
   375  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   376  
   377  ### Graduation Criteria [optional]
   378  This propose the new CRD to belong to the same API group than other cluster-api resources, e.g. machine, machineSet and to follow the same release cadence.
   379  
   380  ### Version Skew Strategy [optional]
   381  
   382  ## Implementation History
   383  - [x] 10/30/2019: Opened proposal PR
   384  - [x] 04/15/2020: Revised to move reconciliation to owner controller
   385  - [x] 08/04/2020: Added external remediation
   386  
   387  <!-- Links -->
   388  [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY