sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20220414-runtime-hooks.md (about)

     1  ---
     2  title: Runtime Hooks for Add-on Management
     3  authors:
     4    - "@killianmuldoon"
     5    - "@ykakarap"
     6  reviewers:
     7    - "@vincepri"
     8    - "@CecileRobertMichon"
     9    - "@enxebre"
    10    - "@fabriziopandini"
    11    - "@sbueringer"
    12  creation-date: 2022-04-14
    13  last-updated: 2022-04-14
    14  status: implementable
    15  replaces:
    16  see-also:
    17  superseded-by:
    18  ---
    19  
    20  # Runtime hooks for Add-on Management
    21  
    22  ## Table of Contents
    23  
    24  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    25  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    26  
    27  - [Glossary](#glossary)
    28  - [Summary](#summary)
    29  - [Motivation](#motivation)
    30    - [Goals](#goals)
    31    - [Non-Goals](#non-goals)
    32  - [Proposal](#proposal)
    33    - [User Stories](#user-stories)
    34    - [Runtime hook definitions](#runtime-hook-definitions)
    35    - [Runtime Extensions developer guide](#runtime-extensions-developer-guide)
    36    - [Security Model](#security-model)
    37    - [Risks and Mitigations](#risks-and-mitigations)
    38      - [Runtime Extension blocks Cluster lifecycle indefinitely](#runtime-extension-blocks-cluster-lifecycle-indefinitely)
    39  - [Alternatives](#alternatives)
    40    - [External components watching CAPI resources without hooks](#external-components-watching-capi-resources-without-hooks)
    41    - [OpenAPI spec implementation alternatives](#openapi-spec-implementation-alternatives)
    42      - [Adding Cluster info to request vs providing only the Cluster name](#adding-cluster-info-to-request-vs-providing-only-the-cluster-name)
    43      - [Embedding CAPI types in request vs using runtime.RawExtension](#embedding-capi-types-in-request-vs-using-runtimerawextension)
    44  - [Upgrade strategy](#upgrade-strategy)
    45      - [Cluster API version upgrade](#cluster-api-version-upgrade)
    46      - [Kubernetes version upgrade](#kubernetes-version-upgrade)
    47  - [Additional Details](#additional-details)
    48    - [Test Plan](#test-plan)
    49    - [Graduation Criteria](#graduation-criteria)
    50    - [Version Skew Strategy](#version-skew-strategy)
    51  - [Implementation History](#implementation-history)
    52  
    53  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    54  
    55  ## Glossary
    56  
    57  Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
    58  
    59  * Add-on: an application that extends the functionality of Kubernetes.
    60  
    61  
    62  ## Summary
    63  
    64  This proposal introduces a set of Runtime Hooks designed for providing the foundation for the implementation of add-on orchestration solutions on top of Cluster API.
    65  However, given that the hooks defined in this proposal are going to model common events in the Cluster lifecycle they could be used for other use cases beyond add-on 
    66  management, but for the sake of having a focused and actionable scope, we are not exploring those option in this document. 
    67  
    68  ## Motivation
    69  
    70  Cluster Resource Set (CRS) is the current add-on management tool packaged with Cluster API, but many users rely on their own package management tool like helm, kapp, ArgoCD or flux, because those tools have a broad range of capabilities that are not currently available in CRS.
    71  
    72  The ability to orchestrate add-ons in line with events in the cluster lifecycle is becoming a requirement for many CAPI users, but in order to make this possible a mechanism to plug into the cluster lifecycle is required. This proposal introduces a set of Runtime Hooks designed to meet the need for add-on management orchestration including:
    73  
    74  * Operations for installing add-ons during the cluster provisioning workflow
    75  * Operations for upgrading add-ons during the cluster upgrade workflow
    76  * Operations for handling add-ons during the cluster deletion workflow
    77  
    78  Runtime Hooks enable the Cluster Lifecycle to trigger these processes based on the current state of the cluster, allowing them to start after some state has been reached or waiting for them to complete before moving on with Cluster-wide operations such as Cluster creation or deletion.
    79  
    80  Once these hooks are in place, it will be possible to build a comprehensive add-on orchestration solution on top of Cluster API that can leverage external tools such as helm, kapp, ArgoCD, flux or eventually CRS as an alternative.
    81  
    82  ### Goals
    83  
    84  * Identify a set of Runtime Hooks that enable management of the entire add-on lifecycle 
    85  * Define the OpenAPI specification of these Runtime Hooks
    86  * Document when the corresponding Runtime Extensions are called
    87  * Provide guidelines for developers implementing a corresponding Runtime Extension
    88  
    89  ### Non-Goals
    90  
    91  * Defining all possible Runtime Hooks in Cluster API; this proposal defines only a subset of hooks required for add-on orchestration.
    92  * Define a full add-on management solution or define detailed steps for solving add-on related problems like CPI migration from in-tree to out-of-tree; this proposal is focused only in providing the foundational capabilities to do so. 
    93  
    94  ## Proposal
    95  
    96  This proposal adds a set of Runtime Hooks specifically designed for Cluster created from a ClusterClass and thus leveraging the idea of managed topology. 
    97  
    98  The main reason for this choice is because a managed topology has a set of capabilities required for lifecycle hooks implementation:
    99  
   100  * A managed topology has the overarching view of the entire Cluster (vs other components in CAPI which are limited to controlling single resources or a subset of them).
   101  * A managed topology already has the capability to control all resources in a Cluster, thus making it possible to orchestrate lifecycle workflows like e.g. upgrades.
   102  
   103  In practice, if we look at the six lifecycle hooks introduced by this proposal, we should recognize that four of them cannot be implemented outside of the topology controller
   104  because there is no viable point to be used in an alternative (BeforeClusterCreate, BeforeClusterUpgrade, AfterControlPlaneUpgrade, and AfterClusterUpgrade). 
   105  
   106  Also, by working in the topology controller it is possible to implement hooks allowing to block the cluster from transitioning from one state to another, which is a capability
   107  required to properly orchestrate the addon lifecycle.
   108  
   109  ###  User Stories
   110  
   111  These user stories are based on a concrete example of an add-on - a metrics database -  to illustrate the use of each Runtime Hook.
   112  
   113  As a developer of an add-ons orchestration solution:
   114  
   115  * **Before a Cluster is Created** I want to automatically check if enough disk space is available for allocation to the cluster for persistent storage of collected metrics values.
   116  * **After the Control Plane** **is Initialized** I want to automatically install a metrics database and associated add-ons in the workload cluster.
   117  * **Before the Cluster is Upgraded** I want to install a new version of the metrics database with a new version of the custom metrics apiservice to interact directly with the Kubernetes apiserver.
   118  * **After the ControlPlane is Upgraded** I want to automatically check that the new version of the custom metrics apiservice is working and correctly fulfilled by my metrics database.
   119  * **After the Cluster is Upgraded** I want to install new versions of metrics collectors to each upgraded node in the cluster.
   120  * **Before the Cluster is Deleted** I want to automatically back up persistent volumes used by the metrics database.
   121  
   122  ### Runtime hook definitions
   123  
   124  Below is a description for the Runtime Hooks introduced by this proposal.
   125  
   126  ![runtime-hooks](images/runtime-hooks/runtime-hooks.png)
   127  
   128  The remainder of this section has been moved to the Cluster API [book](../../docs/book/src/tasks/experimental-features/runtime-sdk/implement-lifecycle-hooks.md#definitions)
   129  to avoid duplication.
   130  
   131  ###  Runtime Extensions developer guide
   132  
   133  This section has been moved to the Cluster API [book](../../docs/book/src/tasks/experimental-features/runtime-sdk/implement-lifecycle-hooks.md#guidelines)
   134  to avoid duplication.
   135  
   136  ###  Security Model
   137  
   138  For the general Runtime Extension security model please refer to the [developer guide in the Runtime SDK proposal](https://github.com/kubernetes-sigs/cluster-api/blob/75b39db545ae439f4f6203b5e07496d3b0a6aa75/docs/proposals/20220221-runtime-SDK.md#security-model).
   139  
   140  ###  Risks and Mitigations
   141  
   142  ####  Runtime Extension blocks Cluster lifecycle indefinitely
   143  
   144  Cluster lifecycle can be blocked indefinitely when a Runtime Extension either blocks or fails indefinitely. Mitigation:
   145  
   146  * Surface errors from the Runtime Extension that is blocking reconciliation in Conditions to drive necessary action needed by the user.
   147  * A Runtime Extension should be extensively unit and e2e tested to ensure it behaves as expected.
   148  * Users should be able to manually intervene and unblock the reconciliation.
   149  
   150  As future work, we will explore more options like circuit breaker and timeout to unblock reconciliation.
   151  
   152  
   153  ## Alternatives
   154  
   155  Alternatives to Runtime Hooks for comprehensive add-on management in Cluster API include:
   156  
   157  ### External components watching CAPI resources without hooks
   158  
   159  This is the current pattern used by Cluster Resource Set. The implementers can only react to changes on CAPI resources, e.g. a Cluster being created, but they have no control over the cluster lifecycle.
   160  
   161  This and similar solutions based on scripting or git ops approaches are considered inadequate for comprehensive, integrated add-on management as they have limited insight into the state of the Cluster and can not easily influence reconciliation based on add-on state.
   162  
   163  More details about why watching Cluster API resources without hooks is not considered a valid alternative can be found in the [Cluster Addon Proposal](https://docs.google.com/document/d/1TdbfXC2_Hhg0mH7-7hXcT1Gg8h6oXKrKbnJqbpFFvjw/edit).
   164  
   165  ### OpenAPI spec implementation alternatives
   166  
   167  For the implementation detail of Open API spec, we considered following alternative approaches:
   168  #### Adding Cluster info to request vs providing only the Cluster name
   169  
   170  In the proposed Open API specification for request type we have a full Cluster object. We considered an alternative of only including the Cluster name and namespace to reduce the size of the message. It was rejected based on the assumption that most extensions will require at least some additional information from the Cluster. Sending the full object reduces calls to the API server.
   171  
   172  #### Embedding CAPI types in request vs using runtime.RawExtension
   173  
   174  In the proposed Open API specification we are including the Cluster API object in requests. We considered using runtime.RawExtension in order to avoid having to bump the version of lifecycle hooks when bumping the version of the CAPI types. It was rejected as sending another version of the CAPI type via runtime.RawExtension would always be a breaking change. Embedding the type directly makes the version of the API used explicit.
   175  
   176  ## Upgrade strategy
   177  
   178  ####  Cluster API version upgrade
   179  
   180  This proposal does not affect the Cluster API upgrade strategy.
   181  
   182  If a new ClusterAPI version introduces a new Lifecycle Hook version, Runtime Extensions should be adapted, to avoid issues when older Lifecycle Hook versions are eventually removed. For details about the deprecation rules please refer to the [Runtime SDK](https://github.com/kubernetes-sigs/cluster-api/blob/75b39db545ae439f4f6203b5e07496d3b0a6aa75/docs/proposals/20220221-runtime-SDK.md#runtime-sdk-rules-1).
   183  
   184  
   185  #### Kubernetes version upgrade
   186  
   187  This proposal does not affect the Cluster API cluster upgrade strategy.
   188  
   189  However Runtime Extension will be able to tap into the upgrade process at defined stages.
   190  
   191  ## Additional Details
   192  
   193  ### Test Plan
   194  
   195  While in alpha phase it is expected that the Runtime Hooks will have unit and integration tests covering the topology reconciliation with calls to Runtime Extensions.
   196  
   197  With the increasing adoption of this feature we expect E2E test coverage for topology reconciliation with a Runtime Extension generating Runtime Hook Responses.
   198  
   199  ### Graduation Criteria
   200  
   201  Main criteria for graduating this feature is adoption; further detail about graduation criteria will be added in future iterations of this document.
   202  
   203  ### Version Skew Strategy
   204  
   205  See [upgrade strategy](#upgrade-strategy).
   206  
   207  ## Implementation History
   208  
   209  * [x] 2022-03-29: Compiled a [CAEP Google Doc](https://docs.google.com/document/d/1vMwzGBi6XbIwKzP5aA7Mj9UdhAWKYqI-QmdDtcxWnA4)
   210  * [x] 2022-04-04: Opened corresponding [issue](https://github.com/kubernetes-sigs/cluster-api/issues/6374)
   211  * [x] 2022-04-06: Presented proposal at a [community meeting]
   212  * [x] 2022-04-14: Opened proposal PR
   213  
   214  <!-- Links -->
   215  [community meeting]: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI/edit#heading=h.pxsq37pzkbdq