sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20200602-machine-deletion-phase-hooks.md (about)

     1  ---
     2  title: Machine Deletion Phase Hooks
     3  authors:
     4    - "@michaelgugino"
     5  reviewers:
     6    - "@enxebre"
     7    - "@vincepri"
     8    - "@detiber"
     9    - "@ncdc"
    10  creation-date: 2020-06-02
    11  last-updated: 2020-08-07
    12  status: implemented
    13  ---
    14  
    15  # Machine Deletion Phase Hooks
    16  
    17  ## Table of Contents
    18  
    19  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    20  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    21  
    22  - [Glossary](#glossary)
    23    - [lifecycle hook](#lifecycle-hook)
    24    - [deletion phase](#deletion-phase)
    25    - [Hook Implementing Controller (HIC)](#hook-implementing-controller-hic)
    26  - [Summary](#summary)
    27  - [Motivation](#motivation)
    28    - [Goals](#goals)
    29    - [Non-Goals/Future Work](#non-goalsfuture-work)
    30  - [Proposal](#proposal)
    31    - [User Stories](#user-stories)
    32      - [Story 1](#story-1)
    33      - [Story 2](#story-2)
    34    - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
    35      - [Lifecycle Points](#lifecycle-points)
    36        - [pre-drain](#pre-drain)
    37        - [pre-terminate](#pre-terminate)
    38      - [Annotation Form](#annotation-form)
    39        - [lifecycle-point](#lifecycle-point)
    40        - [hook-name](#hook-name)
    41        - [owner (Optional)](#owner-optional)
    42        - [Annotation Examples](#annotation-examples)
    43      - [Changes to machine-controller](#changes-to-machine-controller)
    44        - [Reconciliation](#reconciliation)
    45        - [Hook failure](#hook-failure)
    46        - [Hook ordering](#hook-ordering)
    47      - [Hook Implementing Controller Design](#hook-implementing-controller-design)
    48        - [Hook Implementing Controllers must](#hook-implementing-controllers-must)
    49        - [Hook Implementing Controllers may](#hook-implementing-controllers-may)
    50      - [Determining when to take action](#determining-when-to-take-action)
    51        - [Failure Mode](#failure-mode)
    52    - [Risks and Mitigations](#risks-and-mitigations)
    53  - [Alternatives](#alternatives)
    54    - [Custom Machine Controller](#custom-machine-controller)
    55    - [Finalizers](#finalizers)
    56    - [Status Field](#status-field)
    57    - [Spec Field](#spec-field)
    58    - [CRDs](#crds)
    59  - [Upgrade Strategy](#upgrade-strategy)
    60  - [Additional Details](#additional-details)
    61  
    62  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    63  
    64  ## Glossary
    65  
    66  Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
    67  
    68  ### lifecycle hook
    69  A specific point in a machine's reconciliation lifecycle where execution of
    70  normal machine-controller behavior is paused or modified.
    71  
    72  ### deletion phase
    73  Describes when a machine has been marked for deletion but is still present
    74  in the API.  Various actions happen during this phase, such as draining a node,
    75  deleting an instance from a cloud provider, and deleting a node object.
    76  
    77  ### Hook Implementing Controller (HIC)
    78  The Hook Implementing Controller describes a controller, other than the
    79  machine-controller, that adds, removes, and/or responds to a particular
    80  lifecycle hook.  Each lifecycle hook should have a single HIC, but an HIC
    81  can optionally manage one or more hooks.
    82  
    83  ## Summary
    84  
    85  Defines a set of annotations that can be applied to a machine which affect the
    86  linear progress of a machine’s lifecycle after a machine has been marked for
    87  deletion.  These annotations are optional and may be applied during machine
    88  creation, sometime after machine creation by a user, or sometime after machine
    89  creation by another controller or application.
    90  
    91  ## Motivation
    92  
    93  Allow custom and 3rd party components to easily interact with a machine or
    94  related resources while that machine's reconciliation is temporarily paused.
    95  This pause in reconciliation will allow these custom components to take action
    96  after a machine has been marked for deletion, but prior to the machine being
    97  drained and/or associated instance terminated.
    98  
    99  ### Goals
   100  
   101  - Define an initial set of hook points for the deletion phase.
   102  - Define an initial set and form of related annotations.
   103  - Define basic expectations for a controller or process that responds to a
   104  lifecycle hook.
   105  
   106  
   107  ### Non-Goals/Future Work
   108  
   109  - Create an exhaustive list of hooks; we can add more over time.
   110  - Create new machine phases.
   111  - Create a mechanism to signal what lifecycle point a machine is at currently.
   112  - Dictate implementation of controllers that respond to the hooks.
   113  - Implement ordering in the machine-controller.
   114  - Require anyone to use these hooks for normal machine operations, these are
   115  strictly optional and for custom integrations only.
   116  
   117  
   118  ## Proposal
   119  
   120  - Utilize annotations to implement lifecycle hooks.
   121  - Each lifecycle point can have 0 or more hooks.
   122  - Hooks do not enforce ordering.
   123  - Hooks found during machine reconciliation effectively pause reconciliation
   124  until all hooks for that lifecycle point are removed from a machine's annotations.
   125  
   126  
   127  ### User Stories
   128  #### Story 1
   129  (pre-terminate) As an operator, I would like to have the ability to perform
   130  different actions between the time a machine is marked deleted in the api and
   131  the time the machine is deleted from the cloud.
   132  
   133  For example, when replacing a control plane machine, ensure a new control
   134  plane machine has been successfully created and joined to the cluster before
   135  removing the instance of the deleted machine. This might be useful in case
   136  there are disruptions during replacement and we need the disk of the existing
   137  instance to perform some disaster recovery operation.  This will also prevent
   138  prolonged periods of having one fewer control plane host in the event the
   139  replacement instance does not come up in a timely manner.
   140  
   141  #### Story 2
   142  (pre-drain) As an operator, I want the ability to utilize my own draining
   143  controller instead of the logic built into the machine-controller.  This will
   144  allow me better flexibility and control over the lifecycle of workloads on each
   145  node.
   146  
   147  ### Implementation Details/Notes/Constraints
   148  
   149  For each defined lifecycle point, one or more hooks may be applied as an annotation to the machine object.  These annotations will pause reconciliation of a machine object until all hooks are resolved for that lifecycle point.  The hooks should be managed by a Hook Implementing Controller or other external application, or
   150  manually created and removed by an administrator.
   151  
   152  #### Lifecycle Points
   153  ##### pre-drain
   154  `pre-drain.delete.hook.machine.cluster.x-k8s.io`
   155  
   156  Hooks defined at this point will prevent the machine-controller from draining a node after the machine-object has been marked for deletion until the hooks are removed.
   157  ##### pre-terminate
   158  `pre-terminate.delete.hook.machine.cluster.x-k8s.io`
   159  
   160  Hooks defined at this point will prevent the machine-controller from
   161  removing/terminating the instance in the cloud provider until the hooks are
   162  removed.
   163  
   164  "pre-terminate" has been chosen over "pre-delete" because "terminate" is more
   165  easily associated with an instance being removed from the cloud or
   166  infrastructure, whereas "delete" is ambiguous as to the actual state of the
   167  machine in its lifecycle.
   168  
   169  
   170  #### Annotation Form
   171  ```
   172  <lifecycle-point>.delete.hook.machine.cluster-api.x-k8s.io/<hook-name>: <owner/creator>
   173  ```
   174  
   175  ##### lifecycle-point
   176  This is the point in the lifecycle of reconciling a machine the annotation will have effect and pause the machine-controller.
   177  
   178  ##### hook-name
   179  Each hook should have a unique and descriptive name that describes in 1-3 words what the intent/reason for the hook is.  Each hook name should be unique and managed by a single entity.
   180  
   181  ##### owner (Optional)
   182  Some information about who created or is otherwise in charge of managing the annotation.  This might be a controller or a username to indicate an administrator applied the hook directly.
   183  
   184  ##### Annotation Examples
   185  
   186  These examples are all hypothetical to illustrate what form annotations should
   187  take.  The names of each hook and the respective controllers are fictional.
   188  
   189  pre-drain.hook.machine.cluster-api.x-k8s.io/migrate-important-app: my-app-migration-controller
   190  
   191  pre-terminate.hook.machine.cluster-api.x-k8s.io/backup-files: my-backup-controller
   192  
   193  pre-terminate.hook.machine.cluster-api.x-k8s.io/wait-for-storage-detach: my-custom-storage-detach-controller
   194  
   195  #### Changes to machine-controller
   196  The machine-controller should check for the existence of 1 or more hooks at
   197  specific points (lifecycle-points) during reconciliation.  If a hook matching
   198  the lifecycle-point is discovered, the machine-controller should stop
   199  reconciling the machine.
   200  
   201  An example of where the pre-drain lifecycle-point might be implemented:
   202  https://github.com/kubernetes-sigs/cluster-api/blob/30c377c0964efc789ab2f3f7361eb323003a7759/controllers/machine_controller.go#L270
   203  
   204  ##### Reconciliation
   205  When a Hook Implementing Controller updates the machine, reconciliation will be
   206  triggered, and the machine will continue reconciling as normal, unless another
   207  hook is still present; there is no need to 'fail' the reconciliation to
   208  enforce requeuing.
   209  
   210  When all hooks for a given lifecycle-point are removed, reconciliation
   211  will continue as normal.
   212  
   213  ##### Hook failure
   214  The machine-controller should not timeout or otherwise consider the lifecycle
   215  hook as 'failed.'  Only the Hook Implementing Controller may decide to remove a
   216  particular lifecycle hook to allow the machine-controller to progress past the
   217  corresponding lifecycle-point.
   218  
   219  ##### Hook ordering
   220  The machine-controller will not attempt to enforce any ordering of hooks.  No
   221  ordering should be expected by the machine-controller.
   222  
   223  Hook Implementing Controllers may choose to provide a mechanism to allow
   224  ordering amongst themselves via whatever means HICs determine.  Examples could
   225  be using CRDs external to the machine-api, gRPC communications, or
   226  additional annotations on the machine or other objects.
   227  
   228  #### Hook Implementing Controller Design
   229  Hook Implementing Controller is the component that manages a particular
   230  lifecycle hook.
   231  
   232  ##### Hook Implementing Controllers must
   233  * Watch machine objects and determine when an appropriate action must be taken.
   234  * After completing the desired hook action, remove the hook annotation.
   235  
   236  ##### Hook Implementing Controllers may
   237  * Watch machine objects and add a hook annotation as desired by the cluster
   238  administrator.
   239  * Coordinate with other Hook Implementing Controllers through any means
   240  possible, such as using common annotations, CRDs, etc. For example, one hook
   241  controller could set an annotation indicating it has finished its work, and
   242  another hook controller could wait for the presence of the annotation before
   243  proceeding.
   244  
   245  #### Determining when to take action
   246  
   247  A Hook Implementing Controller should watch machines and determine when is the
   248  best time to take action.
   249  
   250  For example, if an HIC manages a lifecycle hook at the pre-drain lifecycle-point,
   251  then that controller should take action immediately after a machine has a
   252  DeletionTimestamp or enters the "Deleting" phase.
   253  
   254  Fine-tuned coordination is not possible at this time; eg, it's not
   255  possible to execute a pre-terminate hook only after a node has been drained.
   256  This is reserved for future work.
   257  
   258  ##### Failure Mode
   259  It is entirely up to the Hook Implementing Controller to determine when it is
   260  prudent to remove a particular lifecycle hook. Some controllers may want to
   261  'give up' after a certain time period, and others may want to block indefinitely.
   262  Cluster operators should consider the characteristics of each controller before
   263  utilizing them in their clusters.
   264  
   265  
   266  ### Risks and Mitigations
   267  
   268  * Annotation keys must conform to length limits: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set
   269  * Requires well-behaved controllers and admins to keep things running
   270  smoothly.  Would be easy to disrupt machines with poor configuration.
   271  * Troubleshooting problems may increase in complexity, but this is
   272  mitigated mostly by the fact that these hooks are opt-in.  Operators
   273  will or should know they are consuming these hooks, but a future proliferation
   274  of the cluster-api could result in these components being bundled as a
   275  complete solution that operators just consume.  To this end, we should
   276  update any troubleshooting guides to check these hook points where possible.
   277  
   278  
   279  ## Alternatives
   280  
   281  ### Custom Machine Controller
   282  Require advanced users to fork and customize.  This can already be done if someone chooses, so not much of a solution.
   283  
   284  ### Finalizers
   285  We define additional finalizers, but this really only implies the deletion lifecycle point.  A misbehaving controller that
   286  accidentally removes finalizers could have undesirable
   287  effects.
   288  
   289  ### Status Field
   290  Harder for users to modify or set hooks during machine creation.  How would a user remove a hook if a controller that is supposed to remove it is misbehaving?  We’d probably need an annotation like ‘skip-hook-xyz’ or similar and that seems redundant to just using annotations in the first place
   291  
   292  ### Spec Field
   293  We probably don’t want other controllers dynamically adding and removing spec fields on an object.  It’s not very declarative to utilize spec fields in that way.
   294  
   295  ### CRDs
   296  Seems like we’d need to sync information to and from a CR.  There are different approaches to CRDs (1-to-1 mapping machine to CR, match labels, present/absent vs status fields) that each have their own drawbacks and are more complex to define and configure.
   297  
   298  
   299  ## Upgrade Strategy
   300  
   301  Nothing defined here should directly impact upgrades other than defining hooks that impact creation/deletion of a machine, generally.
   302  
   303  ## Additional Details
   304  
   305  Fine-tuned timing of hooks is not possible at this time.
   306  
   307  In the future, it is possible to implement this timing via additional
   308  machine phases, or possible "sub-phases" or some other mechanism
   309  that might be appropriate.  As stated in the non-goals, that is
   310  not in scope at this time, and could be future work.  This is currently
   311  being discussed in [issue 3365].
   312  
   313  <!-- Links -->
   314  [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY
   315  [issue 3365]: https://github.com/kubernetes-sigs/cluster-api/issues/3365