sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20200602-machine-deletion-phase-hooks.md (about) 1 --- 2 title: Machine Deletion Phase Hooks 3 authors: 4 - "@michaelgugino" 5 reviewers: 6 - "@enxebre" 7 - "@vincepri" 8 - "@detiber" 9 - "@ncdc" 10 creation-date: 2020-06-02 11 last-updated: 2020-08-07 12 status: implemented 13 --- 14 15 # Machine Deletion Phase Hooks 16 17 ## Table of Contents 18 19 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 20 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 21 22 - [Glossary](#glossary) 23 - [lifecycle hook](#lifecycle-hook) 24 - [deletion phase](#deletion-phase) 25 - [Hook Implementing Controller (HIC)](#hook-implementing-controller-hic) 26 - [Summary](#summary) 27 - [Motivation](#motivation) 28 - [Goals](#goals) 29 - [Non-Goals/Future Work](#non-goalsfuture-work) 30 - [Proposal](#proposal) 31 - [User Stories](#user-stories) 32 - [Story 1](#story-1) 33 - [Story 2](#story-2) 34 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 35 - [Lifecycle Points](#lifecycle-points) 36 - [pre-drain](#pre-drain) 37 - [pre-terminate](#pre-terminate) 38 - [Annotation Form](#annotation-form) 39 - [lifecycle-point](#lifecycle-point) 40 - [hook-name](#hook-name) 41 - [owner (Optional)](#owner-optional) 42 - [Annotation Examples](#annotation-examples) 43 - [Changes to machine-controller](#changes-to-machine-controller) 44 - [Reconciliation](#reconciliation) 45 - [Hook failure](#hook-failure) 46 - [Hook ordering](#hook-ordering) 47 - [Hook Implementing Controller Design](#hook-implementing-controller-design) 48 - [Hook Implementing Controllers must](#hook-implementing-controllers-must) 49 - [Hook Implementing Controllers may](#hook-implementing-controllers-may) 50 - [Determining when to take action](#determining-when-to-take-action) 51 - [Failure Mode](#failure-mode) 52 - [Risks and Mitigations](#risks-and-mitigations) 53 - [Alternatives](#alternatives) 54 - [Custom Machine Controller](#custom-machine-controller) 55 - [Finalizers](#finalizers) 56 - [Status Field](#status-field) 57 - [Spec Field](#spec-field) 58 - [CRDs](#crds) 59 - [Upgrade Strategy](#upgrade-strategy) 60 - [Additional Details](#additional-details) 61 62 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 63 64 ## Glossary 65 66 Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). 67 68 ### lifecycle hook 69 A specific point in a machine's reconciliation lifecycle where execution of 70 normal machine-controller behavior is paused or modified. 71 72 ### deletion phase 73 Describes when a machine has been marked for deletion but is still present 74 in the API. Various actions happen during this phase, such as draining a node, 75 deleting an instance from a cloud provider, and deleting a node object. 76 77 ### Hook Implementing Controller (HIC) 78 The Hook Implementing Controller describes a controller, other than the 79 machine-controller, that adds, removes, and/or responds to a particular 80 lifecycle hook. Each lifecycle hook should have a single HIC, but an HIC 81 can optionally manage one or more hooks. 82 83 ## Summary 84 85 Defines a set of annotations that can be applied to a machine which affect the 86 linear progress of a machine’s lifecycle after a machine has been marked for 87 deletion. These annotations are optional and may be applied during machine 88 creation, sometime after machine creation by a user, or sometime after machine 89 creation by another controller or application. 90 91 ## Motivation 92 93 Allow custom and 3rd party components to easily interact with a machine or 94 related resources while that machine's reconciliation is temporarily paused. 95 This pause in reconciliation will allow these custom components to take action 96 after a machine has been marked for deletion, but prior to the machine being 97 drained and/or associated instance terminated. 98 99 ### Goals 100 101 - Define an initial set of hook points for the deletion phase. 102 - Define an initial set and form of related annotations. 103 - Define basic expectations for a controller or process that responds to a 104 lifecycle hook. 105 106 107 ### Non-Goals/Future Work 108 109 - Create an exhaustive list of hooks; we can add more over time. 110 - Create new machine phases. 111 - Create a mechanism to signal what lifecycle point a machine is at currently. 112 - Dictate implementation of controllers that respond to the hooks. 113 - Implement ordering in the machine-controller. 114 - Require anyone to use these hooks for normal machine operations, these are 115 strictly optional and for custom integrations only. 116 117 118 ## Proposal 119 120 - Utilize annotations to implement lifecycle hooks. 121 - Each lifecycle point can have 0 or more hooks. 122 - Hooks do not enforce ordering. 123 - Hooks found during machine reconciliation effectively pause reconciliation 124 until all hooks for that lifecycle point are removed from a machine's annotations. 125 126 127 ### User Stories 128 #### Story 1 129 (pre-terminate) As an operator, I would like to have the ability to perform 130 different actions between the time a machine is marked deleted in the api and 131 the time the machine is deleted from the cloud. 132 133 For example, when replacing a control plane machine, ensure a new control 134 plane machine has been successfully created and joined to the cluster before 135 removing the instance of the deleted machine. This might be useful in case 136 there are disruptions during replacement and we need the disk of the existing 137 instance to perform some disaster recovery operation. This will also prevent 138 prolonged periods of having one fewer control plane host in the event the 139 replacement instance does not come up in a timely manner. 140 141 #### Story 2 142 (pre-drain) As an operator, I want the ability to utilize my own draining 143 controller instead of the logic built into the machine-controller. This will 144 allow me better flexibility and control over the lifecycle of workloads on each 145 node. 146 147 ### Implementation Details/Notes/Constraints 148 149 For each defined lifecycle point, one or more hooks may be applied as an annotation to the machine object. These annotations will pause reconciliation of a machine object until all hooks are resolved for that lifecycle point. The hooks should be managed by a Hook Implementing Controller or other external application, or 150 manually created and removed by an administrator. 151 152 #### Lifecycle Points 153 ##### pre-drain 154 `pre-drain.delete.hook.machine.cluster.x-k8s.io` 155 156 Hooks defined at this point will prevent the machine-controller from draining a node after the machine-object has been marked for deletion until the hooks are removed. 157 ##### pre-terminate 158 `pre-terminate.delete.hook.machine.cluster.x-k8s.io` 159 160 Hooks defined at this point will prevent the machine-controller from 161 removing/terminating the instance in the cloud provider until the hooks are 162 removed. 163 164 "pre-terminate" has been chosen over "pre-delete" because "terminate" is more 165 easily associated with an instance being removed from the cloud or 166 infrastructure, whereas "delete" is ambiguous as to the actual state of the 167 machine in its lifecycle. 168 169 170 #### Annotation Form 171 ``` 172 <lifecycle-point>.delete.hook.machine.cluster-api.x-k8s.io/<hook-name>: <owner/creator> 173 ``` 174 175 ##### lifecycle-point 176 This is the point in the lifecycle of reconciling a machine the annotation will have effect and pause the machine-controller. 177 178 ##### hook-name 179 Each hook should have a unique and descriptive name that describes in 1-3 words what the intent/reason for the hook is. Each hook name should be unique and managed by a single entity. 180 181 ##### owner (Optional) 182 Some information about who created or is otherwise in charge of managing the annotation. This might be a controller or a username to indicate an administrator applied the hook directly. 183 184 ##### Annotation Examples 185 186 These examples are all hypothetical to illustrate what form annotations should 187 take. The names of each hook and the respective controllers are fictional. 188 189 pre-drain.hook.machine.cluster-api.x-k8s.io/migrate-important-app: my-app-migration-controller 190 191 pre-terminate.hook.machine.cluster-api.x-k8s.io/backup-files: my-backup-controller 192 193 pre-terminate.hook.machine.cluster-api.x-k8s.io/wait-for-storage-detach: my-custom-storage-detach-controller 194 195 #### Changes to machine-controller 196 The machine-controller should check for the existence of 1 or more hooks at 197 specific points (lifecycle-points) during reconciliation. If a hook matching 198 the lifecycle-point is discovered, the machine-controller should stop 199 reconciling the machine. 200 201 An example of where the pre-drain lifecycle-point might be implemented: 202 https://github.com/kubernetes-sigs/cluster-api/blob/30c377c0964efc789ab2f3f7361eb323003a7759/controllers/machine_controller.go#L270 203 204 ##### Reconciliation 205 When a Hook Implementing Controller updates the machine, reconciliation will be 206 triggered, and the machine will continue reconciling as normal, unless another 207 hook is still present; there is no need to 'fail' the reconciliation to 208 enforce requeuing. 209 210 When all hooks for a given lifecycle-point are removed, reconciliation 211 will continue as normal. 212 213 ##### Hook failure 214 The machine-controller should not timeout or otherwise consider the lifecycle 215 hook as 'failed.' Only the Hook Implementing Controller may decide to remove a 216 particular lifecycle hook to allow the machine-controller to progress past the 217 corresponding lifecycle-point. 218 219 ##### Hook ordering 220 The machine-controller will not attempt to enforce any ordering of hooks. No 221 ordering should be expected by the machine-controller. 222 223 Hook Implementing Controllers may choose to provide a mechanism to allow 224 ordering amongst themselves via whatever means HICs determine. Examples could 225 be using CRDs external to the machine-api, gRPC communications, or 226 additional annotations on the machine or other objects. 227 228 #### Hook Implementing Controller Design 229 Hook Implementing Controller is the component that manages a particular 230 lifecycle hook. 231 232 ##### Hook Implementing Controllers must 233 * Watch machine objects and determine when an appropriate action must be taken. 234 * After completing the desired hook action, remove the hook annotation. 235 236 ##### Hook Implementing Controllers may 237 * Watch machine objects and add a hook annotation as desired by the cluster 238 administrator. 239 * Coordinate with other Hook Implementing Controllers through any means 240 possible, such as using common annotations, CRDs, etc. For example, one hook 241 controller could set an annotation indicating it has finished its work, and 242 another hook controller could wait for the presence of the annotation before 243 proceeding. 244 245 #### Determining when to take action 246 247 A Hook Implementing Controller should watch machines and determine when is the 248 best time to take action. 249 250 For example, if an HIC manages a lifecycle hook at the pre-drain lifecycle-point, 251 then that controller should take action immediately after a machine has a 252 DeletionTimestamp or enters the "Deleting" phase. 253 254 Fine-tuned coordination is not possible at this time; eg, it's not 255 possible to execute a pre-terminate hook only after a node has been drained. 256 This is reserved for future work. 257 258 ##### Failure Mode 259 It is entirely up to the Hook Implementing Controller to determine when it is 260 prudent to remove a particular lifecycle hook. Some controllers may want to 261 'give up' after a certain time period, and others may want to block indefinitely. 262 Cluster operators should consider the characteristics of each controller before 263 utilizing them in their clusters. 264 265 266 ### Risks and Mitigations 267 268 * Annotation keys must conform to length limits: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set 269 * Requires well-behaved controllers and admins to keep things running 270 smoothly. Would be easy to disrupt machines with poor configuration. 271 * Troubleshooting problems may increase in complexity, but this is 272 mitigated mostly by the fact that these hooks are opt-in. Operators 273 will or should know they are consuming these hooks, but a future proliferation 274 of the cluster-api could result in these components being bundled as a 275 complete solution that operators just consume. To this end, we should 276 update any troubleshooting guides to check these hook points where possible. 277 278 279 ## Alternatives 280 281 ### Custom Machine Controller 282 Require advanced users to fork and customize. This can already be done if someone chooses, so not much of a solution. 283 284 ### Finalizers 285 We define additional finalizers, but this really only implies the deletion lifecycle point. A misbehaving controller that 286 accidentally removes finalizers could have undesirable 287 effects. 288 289 ### Status Field 290 Harder for users to modify or set hooks during machine creation. How would a user remove a hook if a controller that is supposed to remove it is misbehaving? We’d probably need an annotation like ‘skip-hook-xyz’ or similar and that seems redundant to just using annotations in the first place 291 292 ### Spec Field 293 We probably don’t want other controllers dynamically adding and removing spec fields on an object. It’s not very declarative to utilize spec fields in that way. 294 295 ### CRDs 296 Seems like we’d need to sync information to and from a CR. There are different approaches to CRDs (1-to-1 mapping machine to CR, match labels, present/absent vs status fields) that each have their own drawbacks and are more complex to define and configure. 297 298 299 ## Upgrade Strategy 300 301 Nothing defined here should directly impact upgrades other than defining hooks that impact creation/deletion of a machine, generally. 302 303 ## Additional Details 304 305 Fine-tuned timing of hooks is not possible at this time. 306 307 In the future, it is possible to implement this timing via additional 308 machine phases, or possible "sub-phases" or some other mechanism 309 that might be appropriate. As stated in the non-goals, that is 310 not in scope at this time, and could be future work. This is currently 311 being discussed in [issue 3365]. 312 313 <!-- Links --> 314 [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY 315 [issue 3365]: https://github.com/kubernetes-sigs/cluster-api/issues/3365