sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20220414-runtime-hooks.md (about) 1 --- 2 title: Runtime Hooks for Add-on Management 3 authors: 4 - "@killianmuldoon" 5 - "@ykakarap" 6 reviewers: 7 - "@vincepri" 8 - "@CecileRobertMichon" 9 - "@enxebre" 10 - "@fabriziopandini" 11 - "@sbueringer" 12 creation-date: 2022-04-14 13 last-updated: 2022-04-14 14 status: implementable 15 replaces: 16 see-also: 17 superseded-by: 18 --- 19 20 # Runtime hooks for Add-on Management 21 22 ## Table of Contents 23 24 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 25 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 26 27 - [Glossary](#glossary) 28 - [Summary](#summary) 29 - [Motivation](#motivation) 30 - [Goals](#goals) 31 - [Non-Goals](#non-goals) 32 - [Proposal](#proposal) 33 - [User Stories](#user-stories) 34 - [Runtime hook definitions](#runtime-hook-definitions) 35 - [Runtime Extensions developer guide](#runtime-extensions-developer-guide) 36 - [Security Model](#security-model) 37 - [Risks and Mitigations](#risks-and-mitigations) 38 - [Runtime Extension blocks Cluster lifecycle indefinitely](#runtime-extension-blocks-cluster-lifecycle-indefinitely) 39 - [Alternatives](#alternatives) 40 - [External components watching CAPI resources without hooks](#external-components-watching-capi-resources-without-hooks) 41 - [OpenAPI spec implementation alternatives](#openapi-spec-implementation-alternatives) 42 - [Adding Cluster info to request vs providing only the Cluster name](#adding-cluster-info-to-request-vs-providing-only-the-cluster-name) 43 - [Embedding CAPI types in request vs using runtime.RawExtension](#embedding-capi-types-in-request-vs-using-runtimerawextension) 44 - [Upgrade strategy](#upgrade-strategy) 45 - [Cluster API version upgrade](#cluster-api-version-upgrade) 46 - [Kubernetes version upgrade](#kubernetes-version-upgrade) 47 - [Additional Details](#additional-details) 48 - [Test Plan](#test-plan) 49 - [Graduation Criteria](#graduation-criteria) 50 - [Version Skew Strategy](#version-skew-strategy) 51 - [Implementation History](#implementation-history) 52 53 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 54 55 ## Glossary 56 57 Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). 58 59 * Add-on: an application that extends the functionality of Kubernetes. 60 61 62 ## Summary 63 64 This proposal introduces a set of Runtime Hooks designed for providing the foundation for the implementation of add-on orchestration solutions on top of Cluster API. 65 However, given that the hooks defined in this proposal are going to model common events in the Cluster lifecycle they could be used for other use cases beyond add-on 66 management, but for the sake of having a focused and actionable scope, we are not exploring those option in this document. 67 68 ## Motivation 69 70 Cluster Resource Set (CRS) is the current add-on management tool packaged with Cluster API, but many users rely on their own package management tool like helm, kapp, ArgoCD or flux, because those tools have a broad range of capabilities that are not currently available in CRS. 71 72 The ability to orchestrate add-ons in line with events in the cluster lifecycle is becoming a requirement for many CAPI users, but in order to make this possible a mechanism to plug into the cluster lifecycle is required. This proposal introduces a set of Runtime Hooks designed to meet the need for add-on management orchestration including: 73 74 * Operations for installing add-ons during the cluster provisioning workflow 75 * Operations for upgrading add-ons during the cluster upgrade workflow 76 * Operations for handling add-ons during the cluster deletion workflow 77 78 Runtime Hooks enable the Cluster Lifecycle to trigger these processes based on the current state of the cluster, allowing them to start after some state has been reached or waiting for them to complete before moving on with Cluster-wide operations such as Cluster creation or deletion. 79 80 Once these hooks are in place, it will be possible to build a comprehensive add-on orchestration solution on top of Cluster API that can leverage external tools such as helm, kapp, ArgoCD, flux or eventually CRS as an alternative. 81 82 ### Goals 83 84 * Identify a set of Runtime Hooks that enable management of the entire add-on lifecycle 85 * Define the OpenAPI specification of these Runtime Hooks 86 * Document when the corresponding Runtime Extensions are called 87 * Provide guidelines for developers implementing a corresponding Runtime Extension 88 89 ### Non-Goals 90 91 * Defining all possible Runtime Hooks in Cluster API; this proposal defines only a subset of hooks required for add-on orchestration. 92 * Define a full add-on management solution or define detailed steps for solving add-on related problems like CPI migration from in-tree to out-of-tree; this proposal is focused only in providing the foundational capabilities to do so. 93 94 ## Proposal 95 96 This proposal adds a set of Runtime Hooks specifically designed for Cluster created from a ClusterClass and thus leveraging the idea of managed topology. 97 98 The main reason for this choice is because a managed topology has a set of capabilities required for lifecycle hooks implementation: 99 100 * A managed topology has the overarching view of the entire Cluster (vs other components in CAPI which are limited to controlling single resources or a subset of them). 101 * A managed topology already has the capability to control all resources in a Cluster, thus making it possible to orchestrate lifecycle workflows like e.g. upgrades. 102 103 In practice, if we look at the six lifecycle hooks introduced by this proposal, we should recognize that four of them cannot be implemented outside of the topology controller 104 because there is no viable point to be used in an alternative (BeforeClusterCreate, BeforeClusterUpgrade, AfterControlPlaneUpgrade, and AfterClusterUpgrade). 105 106 Also, by working in the topology controller it is possible to implement hooks allowing to block the cluster from transitioning from one state to another, which is a capability 107 required to properly orchestrate the addon lifecycle. 108 109 ### User Stories 110 111 These user stories are based on a concrete example of an add-on - a metrics database - to illustrate the use of each Runtime Hook. 112 113 As a developer of an add-ons orchestration solution: 114 115 * **Before a Cluster is Created** I want to automatically check if enough disk space is available for allocation to the cluster for persistent storage of collected metrics values. 116 * **After the Control Plane** **is Initialized** I want to automatically install a metrics database and associated add-ons in the workload cluster. 117 * **Before the Cluster is Upgraded** I want to install a new version of the metrics database with a new version of the custom metrics apiservice to interact directly with the Kubernetes apiserver. 118 * **After the ControlPlane is Upgraded** I want to automatically check that the new version of the custom metrics apiservice is working and correctly fulfilled by my metrics database. 119 * **After the Cluster is Upgraded** I want to install new versions of metrics collectors to each upgraded node in the cluster. 120 * **Before the Cluster is Deleted** I want to automatically back up persistent volumes used by the metrics database. 121 122 ### Runtime hook definitions 123 124 Below is a description for the Runtime Hooks introduced by this proposal. 125 126  127 128 The remainder of this section has been moved to the Cluster API [book](../../docs/book/src/tasks/experimental-features/runtime-sdk/implement-lifecycle-hooks.md#definitions) 129 to avoid duplication. 130 131 ### Runtime Extensions developer guide 132 133 This section has been moved to the Cluster API [book](../../docs/book/src/tasks/experimental-features/runtime-sdk/implement-lifecycle-hooks.md#guidelines) 134 to avoid duplication. 135 136 ### Security Model 137 138 For the general Runtime Extension security model please refer to the [developer guide in the Runtime SDK proposal](https://github.com/kubernetes-sigs/cluster-api/blob/75b39db545ae439f4f6203b5e07496d3b0a6aa75/docs/proposals/20220221-runtime-SDK.md#security-model). 139 140 ### Risks and Mitigations 141 142 #### Runtime Extension blocks Cluster lifecycle indefinitely 143 144 Cluster lifecycle can be blocked indefinitely when a Runtime Extension either blocks or fails indefinitely. Mitigation: 145 146 * Surface errors from the Runtime Extension that is blocking reconciliation in Conditions to drive necessary action needed by the user. 147 * A Runtime Extension should be extensively unit and e2e tested to ensure it behaves as expected. 148 * Users should be able to manually intervene and unblock the reconciliation. 149 150 As future work, we will explore more options like circuit breaker and timeout to unblock reconciliation. 151 152 153 ## Alternatives 154 155 Alternatives to Runtime Hooks for comprehensive add-on management in Cluster API include: 156 157 ### External components watching CAPI resources without hooks 158 159 This is the current pattern used by Cluster Resource Set. The implementers can only react to changes on CAPI resources, e.g. a Cluster being created, but they have no control over the cluster lifecycle. 160 161 This and similar solutions based on scripting or git ops approaches are considered inadequate for comprehensive, integrated add-on management as they have limited insight into the state of the Cluster and can not easily influence reconciliation based on add-on state. 162 163 More details about why watching Cluster API resources without hooks is not considered a valid alternative can be found in the [Cluster Addon Proposal](https://docs.google.com/document/d/1TdbfXC2_Hhg0mH7-7hXcT1Gg8h6oXKrKbnJqbpFFvjw/edit). 164 165 ### OpenAPI spec implementation alternatives 166 167 For the implementation detail of Open API spec, we considered following alternative approaches: 168 #### Adding Cluster info to request vs providing only the Cluster name 169 170 In the proposed Open API specification for request type we have a full Cluster object. We considered an alternative of only including the Cluster name and namespace to reduce the size of the message. It was rejected based on the assumption that most extensions will require at least some additional information from the Cluster. Sending the full object reduces calls to the API server. 171 172 #### Embedding CAPI types in request vs using runtime.RawExtension 173 174 In the proposed Open API specification we are including the Cluster API object in requests. We considered using runtime.RawExtension in order to avoid having to bump the version of lifecycle hooks when bumping the version of the CAPI types. It was rejected as sending another version of the CAPI type via runtime.RawExtension would always be a breaking change. Embedding the type directly makes the version of the API used explicit. 175 176 ## Upgrade strategy 177 178 #### Cluster API version upgrade 179 180 This proposal does not affect the Cluster API upgrade strategy. 181 182 If a new ClusterAPI version introduces a new Lifecycle Hook version, Runtime Extensions should be adapted, to avoid issues when older Lifecycle Hook versions are eventually removed. For details about the deprecation rules please refer to the [Runtime SDK](https://github.com/kubernetes-sigs/cluster-api/blob/75b39db545ae439f4f6203b5e07496d3b0a6aa75/docs/proposals/20220221-runtime-SDK.md#runtime-sdk-rules-1). 183 184 185 #### Kubernetes version upgrade 186 187 This proposal does not affect the Cluster API cluster upgrade strategy. 188 189 However Runtime Extension will be able to tap into the upgrade process at defined stages. 190 191 ## Additional Details 192 193 ### Test Plan 194 195 While in alpha phase it is expected that the Runtime Hooks will have unit and integration tests covering the topology reconciliation with calls to Runtime Extensions. 196 197 With the increasing adoption of this feature we expect E2E test coverage for topology reconciliation with a Runtime Extension generating Runtime Hook Responses. 198 199 ### Graduation Criteria 200 201 Main criteria for graduating this feature is adoption; further detail about graduation criteria will be added in future iterations of this document. 202 203 ### Version Skew Strategy 204 205 See [upgrade strategy](#upgrade-strategy). 206 207 ## Implementation History 208 209 * [x] 2022-03-29: Compiled a [CAEP Google Doc](https://docs.google.com/document/d/1vMwzGBi6XbIwKzP5aA7Mj9UdhAWKYqI-QmdDtcxWnA4) 210 * [x] 2022-04-04: Opened corresponding [issue](https://github.com/kubernetes-sigs/cluster-api/issues/6374) 211 * [x] 2022-04-06: Presented proposal at a [community meeting] 212 * [x] 2022-04-14: Opened proposal PR 213 214 <!-- Links --> 215 [community meeting]: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI/edit#heading=h.pxsq37pzkbdq