sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20220725-managed-kubernetes.md (about)

     1  ---
     2  title: Managed Kubernetes in CAPI
     3  authors:
     4    - “@pydctw”
     5    - "@richardcase"
     6  reviewers:
     7    - “@alexeldeib”
     8    - “@CecileRobertMichon”
     9    - “@enxebre”
    10    - “@fabriziopandini”
    11    - “@jackfrancis”
    12    - "@joekr"
    13    - “@sbueringer”
    14    - "@shyamradhakrishnan"
    15    - "@yastij"
    16  creation-date: 2022-07-25
    17  last-updated: 2023-06-15
    18  status: implementable
    19  see-also: ./20230407-flexible-managed-k8s-endpoints.md
    20  replaces:
    21  superseded-by:
    22  ---
    23  
    24  # Managed Kubernetes in CAPI
    25  
    26  ## Table of Contents
    27  
    28  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    29  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    30  
    31  - [Glossary](#glossary)
    32  - [Summary](#summary)
    33  - [Motivation](#motivation)
    34    - [Goals](#goals)
    35    - [Non-Goals/Future Work](#non-goalsfuture-work)
    36  - [Proposal](#proposal)
    37    - [Personas](#personas)
    38      - [Cluster Service Provider](#cluster-service-provider)
    39      - [Cluster Service Consumer](#cluster-service-consumer)
    40      - [Cluster Admin](#cluster-admin)
    41      - [Cluster User](#cluster-user)
    42    - [User Stories](#user-stories)
    43      - [Story 1](#story-1)
    44      - [Story 2](#story-2)
    45      - [Story 3](#story-3)
    46      - [Story 4](#story-4)
    47      - [Story 5](#story-5)
    48      - [Story 6](#story-6)
    49    - [Current State of Managed Kubernetes in CAPI](#current-state-of-managed-kubernetes-in-capi)
    50      - [EKS in CAPA](#eks-in-capa)
    51      - [AKS in CAPZ](#aks-in-capz)
    52      - [OKE in CAPOCI](#oke-in-capoci)
    53      - [GKE in CAPG](#gke-in-capg)
    54    - [Managed Kubernetes API Design Approaches](#managed-kubernetes-api-design-approaches)
    55      - [Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster](#option-1-two-kinds-with-a-controlplane-and-a-pass-through-infracluster)
    56      - [Option 2: Just a ControlPlane kind and no InfraCluster](#option-2-just-a-controlplane-kind-and-no-infracluster)
    57      - [Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities](#option-3-two-kinds-with-a-managed-control-plane-and-managed-infra-cluster-with-better-separation-of-responsibilities)
    58  - [Recommendations](#recommendations)
    59    - [Vanilla Managed Kubernetes (i.e. without any additional infrastructure)](#vanilla-managed-kubernetes-ie-without-any-additional-infrastructure)
    60    - [Existing Managed Kubernetes Implementations](#existing-managed-kubernetes-implementations)
    61    - [Additional notes on option 3](#additional-notes-on-option-3)
    62    - [Managed Node Groups for Worker Nodes](#managed-node-groups-for-worker-nodes)
    63    - [Provider Implementers Documentation](#provider-implementers-documentation)
    64  - [Other Considerations for CAPI](#other-considerations-for-capi)
    65    - [ClusterClass support for MachinePool](#clusterclass-support-for-machinepool)
    66    - [clusterctl integration](#clusterctl-integration)
    67    - [Add-ons management](#add-ons-management)
    68  - [Alternatives](#alternatives)
    69    - [Alternative 1: Single kind for Control Plane and Infrastructure](#alternative-1-single-kind-for-control-plane-and-infrastructure)
    70      - [Background: Why did EKS in CAPA choose this option?](#background-why-did-eks-in-capa-choose-this-option)
    71    - [Alternative 2: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities](#alternative-2-two-kinds-with-a-managed-control-plane-and-shared-infra-cluster-with-better-separation-of-responsibilities)
    72  - [Upgrade Strategy](#upgrade-strategy)
    73  - [Implementation History](#implementation-history)
    74  
    75  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    76  
    77  ## Glossary
    78  
    79  - **Managed Kubernetes** - a Kubernetes service offered/hosted by a service provider where the control plane is run & managed by the service provider. As a cluster service consumer, you don’t have to worry about managing/operating the control plane machines. Additionally, the managed Kubernetes service may extend to cover running managed worker nodes. Examples are EKS in AWS and AKS in Azure. This is different from a traditional implementation in Cluster API, where the control plane and worker nodes are deployed and managed by the cluster admin.
    80  - **Unmanaged Kubernetes** - a Kubernetes cluster where a cluster admin is responsible for provisioning and operating the control plane and worker nodes. In Cluster API this traditionally means a Kubeadm bootstrapped cluster on infrastructure machines (virtual or physical).
    81  - **Managed Worker Node** - an individual Kubernetes worker node where the underlying compute (vm or bare-metal) is provisioned and managed by the service provider. This usually includes the joining of the newly provisioned node into a Managed Kubernetes cluster. The lifecycle is normally controlled via a higher level construct such as a Managed Node Group.
    82  - **Managed Node Group** - is a service that a service provider offers that automates the provisioning of managed worker nodes. Depending on the service provider this group of nodes could contain a fixed number of replicas or it might contain a dynamic pool of replicas that auto-scales up and down. Examples are Node Pools in GCP and EKS managed node groups.
    83  - **Cluster Infrastructure Provider (Infrastructure)** - an Infrastructure provider supplies whatever prerequisites are necessary for creating & running clusters such as networking, load balancers, firewall rules, and so on. ([docs](../book/src/developer/providers/cluster-infrastructure.md))
    84  - **ControlPlane Provider (ControlPlane)** - a control plane provider instantiates a Kubernetes control plane consisting of k8s control plane components such as kube-apiserver, etcd, kube-scheduler and kube-controller-manager. ([docs](../book/src/developer/architecture/controllers/control-plane.md#control-plane-provider))
    85  - **MachineDeployment** - a MachineDeployment orchestrates deployments over a fleet of MachineSets, which is an immutable abstraction over Machines. ([docs](../book/src/developer/architecture/controllers/machine-deployment.md))
    86  - **MachinePool (experimental)** - a MachinePool is similar to a MachineDeployment in that they both define configuration and policy for how a set of machines are managed. While the MachineDeployment uses MachineSets to orchestrate updates to the Machines, MachinePool delegates the responsibility to a cloud provider specific resource such as AWS Auto Scale Groups, GCP Managed Instance Groups, and Azure Virtual Machine Scale Sets. ([docs](./20190919-machinepool-api.md))
    87  
    88  ## Summary
    89  
    90  This proposal discusses various options on how a managed Kubernetes services could be represented in Cluster API by providers. Recommendations will be made on which approach(s) to adopt for new implementations by providers with a view of eventually having consistency across provider implementations.
    91  
    92  ## Motivation
    93  
    94  Cluster API was originally designed with unmanaged Kubernetes clusters in mind as the cloud providers did not offer managed Kubernetes services (except GCP with GKE). However, all 3 main cloud providers (and many other cloud/service providers) now have managed Kubernetes services.
    95  
    96  Some Cluster API Providers (i.e. Azure with AKS first and then AWS with EKS) have implemented support for their managed Kubernetes services. These implementations have followed the existing documentation & contracts (that were designed for unmanaged Kubernetes) and have ended up with 2 different implementations.
    97  
    98  While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with other parts of CAPI assuming the two objects are different (Reference [issue here](https://github.com/kubernetes-sigs/cluster-api/issues/6126)).
    99  
   100  Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly. However, after the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented there is the option to supply only the control plane, but you still cannot supply the same resource for both.
   101  
   102  The responsibilities between the CAPI control plane and infrastructure are blurred with a managed Kubernetes service like AKS or EKS. For example, when you create a EKS control plane in AWS it also creates infrastructure that CAPI would traditionally view as the responsibility of the cluster “infrastructure provider”.
   103  
   104  A good example here is the API server load balancer:
   105  
   106  - Currently CAPI expects the control plane load balancer to be created by the cluster infrastructure provider and for its endpoint to be returned via the `ControlPlaneEndpoint` on the InfraCluster.
   107  - In AWS when creating an EKS cluster (which is the Kubernetes control plane), a load balancer is automatically created for you. When representing EKS in CAPI this naturally maps to a “control plane provider” but this causes complications as we need to report the endpoint back via the cluster infrastructure provider and not the control plane provider.
   108  
   109  ### Goals
   110  
   111  - Provide a recommendation of a consistent approach for representing Managed Kubernetes services in CAPI for new implementations.
   112    - It would be ideal for there to be consistency between providers when it comes to representing Managed Kubernetes services. However, it's unrealistic to ask providers to refactor their existing implementations.
   113  - Ensure the recommendation provides a working model for Managed Kubernetes integration with ClusterClass.
   114  - As a result of the recommendations of this proposal we should update the [Provider Implementers](../book/src/developer/providers/implementers.md) documentation to aid with future provider implementations.
   115  
   116  ### Non-Goals/Future Work
   117  
   118  - Enforce the Managed Kubernetes recommendations as a requirement for Cluster API providers when they implement Managed Kubernetes.
   119    - If providers that have already implemented Managed Kubernetes and would like guidance on if/how they could move to be aligned with the recommendations of this proposal then discussions should be facilitated.
   120  - Provide advice in this proposal on how to refactor the existing implementations of managed Kubernetes in CAPA & CAPZ.
   121  - Propose a new architecture or API changes to CAPI for managed Kubernetes. This has been covered by the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md).
   122  - Be a concrete design for the GKE implementation in Cluster API Provider GCP (CAPG).
   123  - Recommend how Managed Kubernetes services would leverage CAPI internally to run their offer.
   124  
   125  ## Proposal
   126  
   127  ### Personas
   128  
   129  #### Cluster Service Provider
   130  
   131  The user hosting cluster control planes, responsible for up-time, UI for fleet wide alerts, configuring a cloud account to host control planes in, views user provisioned infra (available compute). Has cluster admin management.
   132  
   133  #### Cluster Service Consumer
   134  
   135  A user empowered to request control planes, request workers to a service provider, and drive upgrades or modify externalized configuration.
   136  
   137  #### Cluster Admin
   138  
   139  A user with cluster-admin role in the provisioned cluster, but may or may not have power over when/how cluster is upgraded or configured.
   140  
   141  #### Cluster User
   142  
   143  A user who uses a provisioned cluster, which usually maps to a developer.
   144  
   145  ### User Stories
   146  
   147  #### Story 1
   148  
   149  As a cluster service consumer,
   150  I want to use Cluster API to provision and manage Kubernetes Clusters that utilize my service providers Managed Kubernetes Service (i.e. EKS, AKS, GKE),
   151  So that I don’t have to worry about the management/provisioning of control plane nodes, and so I can take advantage of any value add services offered by the service provider.
   152  
   153  #### Story 2
   154  
   155  As a cluster service consumer,
   156  I want to use Cluster API to provision and manage the lifecycle of worker nodes that utilizes my cloud providers’ managed instances (if they support them),
   157  So that I don't have to worry about the management of these instances.
   158  
   159  #### Story 3
   160  
   161  As a cluster admin,
   162  I want to be able to provision both “unmanaged” and “managed” Kubernetes clusters from the same management cluster,
   163  So that I can support different requirements & use cases as needed whilst using a single operating model.
   164  
   165  #### Story 4
   166  
   167  As a Cluster API provider developer,
   168  I want guidance on how to incorporate a managed Kubernetes service into my provider,
   169  So that its usage is compatible with Cluster API architecture/features
   170  And its usage is consistent with other providers.
   171  
   172  #### Story 5
   173  
   174  As a Cluster API provider developer,
   175  I want to enable the ClusterClass feature for a managed Kubernetes service,
   176  So that cluster users can take advantage of an improved UX with ClusterClass-based clusters.
   177  
   178  #### Story 6
   179  
   180  As a cluster service provider,
   181  I want to be able to offer “Managed Kubernetes” powered by CAPI,
   182  So that I can eliminate the responsibility of owning and SREing the Control Plane from the Cluster service consumer and cluster admin.
   183  
   184  ### Current State of Managed Kubernetes in CAPI
   185  
   186  #### EKS in CAPA
   187  
   188  - [Docs](https://cluster-api-aws.sigs.k8s.io/topics/eks/index.html)
   189  - Feature Status: GA
   190  - CRDs
   191    - AWSManagedControlPlane - provision EKS cluster
   192    - AWSManagedMachinePool - corresponds to EKS managed node pool
   193  - Supported Flavors
   194    - AWSManagedControlPlane with MachineDeployment / AWSMachine
   195    - AWSManagedControlPlane with MachinePool / AWSMachinePool
   196    - AWSManagedControlPlane with MachinePool / AWSManagedMachinePool
   197  - Bootstrap Provider
   198    - Cluster API bootstrap provider EKS (CABPE)
   199  - Features
   200    - Provisioning/managing an Amazon EKS Cluster
   201    - Upgrading the Kubernetes version of the EKS Cluster
   202    - Attaching self-managed machines as nodes to the EKS cluster
   203    - Creating a machine pool and attaching it to the EKS cluster (experimental)
   204    - Creating a managed machine pool and attaching it to the EKS cluster
   205    - Managing “EKS Addons”
   206    - Creating an EKS Fargate profile (experimental)
   207    - Managing aws-iam-authenticator configuration
   208  
   209  #### AKS in CAPZ
   210  
   211  - [Docs](https://capz.sigs.k8s.io/topics/managedcluster.html)
   212  - Feature Status: GA
   213  - CRDs
   214    - AzureManagedControlPlane, AzureManagedCluster - provision AKS cluster
   215    - AzureManagedMachinePool - corresponds to AKS node pool
   216  - Supported Flavor
   217    - AzureManagedControlPlane + AzureManagedCluster with AzureManagedMachinePool
   218  
   219  #### OKE in CAPOCI
   220  
   221  - [Docs](https://oracle.github.io/cluster-api-provider-oci/managed/managedcluster.html)
   222  - Feature Status: Experimental
   223  - CRDs
   224    - OCIManagedControlPlane, OCIManagedCluster - provision OKE cluster
   225    - OCIManagedMachinePool, OCIVirtualMachinePool - machine pool implementations
   226  - Supported Flavors:
   227    - OCIManagedControlPlane + OCIManagedCluster with OCIManagedMachinePool
   228    - OCIManagedControlPlane + OCIManagedCluster with OCIVirtualMachinePool
   229  
   230  #### GKE in CAPG
   231  
   232  - [Docs](https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/main/docs/book/src/topics/gke/index.md)
   233  - Feature Status: Experimental
   234  - CRDs
   235    - GCPManagedControlPlane, GCPManagedCluster - provision GKE cluster
   236    - GCPManagedMachinePool - corresponds to managed node pool
   237  - Support flavor
   238    - GCPManagedControlPlane + GCPManagedCluster with GCPManagedMachinePool
   239  
   240  ### Managed Kubernetes API Design Approaches
   241  
   242  When discussing the different approaches to represent a managed Kubernetes service in CAPI, we will be using the implementation of GKE support in CAPG as an example.
   243  
   244  > NOTE: “naming things is hard” so the names of the kinds/structs/fields used in the CAPG examples below are illustrative only and are not the focus of this proposal. There is debate, for example, as to whether `GCPManagedCluster` or `GKECluster` should be used.
   245  
   246  The following section discusses different API implementation options along with pros and cons of each.
   247  
   248  #### Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster
   249  
   250  **This option will be no longer needed when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented as option 2 can be used for a simpler solution**
   251  
   252  This option introduces 2 new resource kinds:
   253  
   254  - **GCPManagedControlPlane**: this represents both a control-plane (i.e. GKE) and infrastructure required for the cluster. It contains properties for both the general cloud infrastructure (that would traditionally be represented by an infrastructure cluster) and the managed Kubernetes control plane (that would traditionally be represented by a control plane provider).
   255  - **GCPManagedCluster**: contains the minimum properties in its spec and status to satisfy the [CAPI contract for an infrastructure cluster](../book/src/developer/providers/cluster-infrastructure.md) (i.e. ControlPlaneEndpoint, Ready condition). Its controller watches GCPManagedControlPlane and copies the ControlPlaneEndpoint field to GCPManagedCluster to report back to CAPI. This is used as a pass-through layer only.
   256  
   257  ```go
   258  type GCPManagedControlPlaneSpec struct {
   259      // Project is the name of the project to deploy the cluster to.
   260      Project string `json:"project"`
   261  
   262      // NetworkSpec encapsulates all things related to the GCP network.
   263      // +optional
   264      Network NetworkSpec `json:"network"`
   265  
   266      // AddonsConfig defines the addons to enable with the GKE cluster.
   267      // +optional
   268      AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"`
   269  
   270      // Logging contains the logging configuration for the GKE cluster.
   271      // +optional
   272      Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"`
   273  
   274      // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled
   275      // +optional
   276      EnableKubernetesAlpha bool
   277  
   278      // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane.
   279      // +optional
   280      ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"`
   281      ....
   282  }
   283  ```
   284  
   285  ```go
   286  type GCPManagedClusterSpec struct {
   287      // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane.
   288      // +optional
   289      ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"`
   290  }
   291  ```
   292  
   293  **This is the design pattern currently used by CAPZ and CAPA**. [An example of how ManagedCluster watches ControlPlane in CAPZ.](https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/5c69b44ed847365525504b242da83b5e5da75e4f/controllers/azuremanagedcluster_controller.go#L71)
   294  
   295  **Pros**
   296  
   297  - Better aligned with CAPI’s traditional infra provider model
   298  - Works with ClusterClass
   299  
   300  **Cons**
   301  
   302  - Need to maintain Infra cluster kind, which is a pass-through layer and has no other functions. In addition to the CRD, controllers, webhooks and conversions webhooks need to be maintained.
   303  - Infra provider doesn’t provision infrastructure and whilst it may meet the CAPI contract, it doesn’t actually create infrastructure as this is done via the control plane.
   304  
   305  #### Option 2: Just a ControlPlane kind and no InfraCluster
   306  
   307  **This option is enabled when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented.**
   308  
   309  This option introduces 1 new resource kind:
   310  
   311  - **GCPManagedControlPlane**: this represents a control-plane (i.e. GKE) required for the cluster. It contains properties for the managed Kubernetes control plane (that would traditionally be represented by a control plane provider).
   312  
   313  ```go
   314  type GCPManagedControlPlaneSpec struct {
   315      // Project is the name of the project to deploy the cluster to.
   316      Project string `json:"project"`
   317  
   318      // AddonsConfig defines the addons to enable with the GKE cluster.
   319      // +optional
   320      AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"`
   321  
   322      // Logging contains the logging configuration for the GKE cluster.
   323      // +optional
   324      Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"`
   325  
   326      // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled
   327      // +optional
   328      EnableKubernetesAlpha bool
   329  
   330      // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane.
   331      // +optional
   332      ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"`
   333      ....
   334  }
   335  ```
   336  
   337  **Pros**
   338  
   339  - Simpler implementation
   340    - No need for a pass-through infra cluster as control plane endpoint can be reported back via the control plane
   341  - Works with ClusterClass
   342  
   343  **Cons**
   344  
   345  - If the configuration/functionality related to the base infrastructure are included then we have mixed concerns of the API type.
   346  
   347  #### Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities
   348  
   349  This option more closely follows the original separation of concerns with the different CAPI provider types. With this option, 2 new resource kinds will be introduced:
   350  
   351  - **GCPManagedControlPlane**: this presents the actual GKE control plane in GCP. Its spec would only contain properties that are specific to the provisioning & management of a GKE cluster in GCP (excluding worker nodes). It would not contain any properties related to the general GCP operating infrastructure, like the networking or project.
   352  - **GCPManagedCluster**: this presents the properties needed to provision and manage the general GCP operating infrastructure for the cluster (i.e project, networking, iam). It would contain similar properties to **GCPCluster** and its reconciliation would be very similar.
   353  
   354  ```go
   355  type GCPManagedControlPlaneSpec struct {
   356      // AddonsConfig defines the addons to enable with the GKE cluster.
   357      // +optional
   358      AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"`
   359  
   360      // Logging contains the logging configuration for the GKE cluster.
   361      // +optional
   362      Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"`
   363  
   364      // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled
   365      // +optional
   366      EnableKubernetesAlpha bool
   367  
   368      ...
   369  }
   370  ```
   371  
   372  ```go
   373  type GCPManagedClusterSpec struct {
   374      // Project is the name of the project to deploy the cluster to.
   375      Project string `json:"project"`
   376  
   377      // The GCP Region the cluster lives in.
   378      Region string `json:"region"`
   379  
   380      // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane.
   381      // +optional
   382      ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"`
   383  
   384      // NetworkSpec encapsulates all things related to the GCP network.
   385      // +optional
   386      Network NetworkSpec `json:"network"`
   387  
   388      // FailureDomains is an optional field which is used to assign selected availability zones to a cluster
   389      // FailureDomains if empty, defaults to all the zones in the selected region and if specified would override
   390      // the default zones.
   391      // +optional
   392      FailureDomains []string `json:"failureDomains,omitempty"`
   393  
   394      // AdditionalLabels is an optional set of tags to add to GCP resources managed by the GCP provider, in addition to the
   395      // ones added by default.
   396      // +optional
   397      AdditionalLabels Labels `json:"additionalLabels,omitempty"`
   398  
   399      ...
   400  }
   401  ```
   402  
   403  When the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented there is the option to return the control plane endpoint directly from the ControlPlane instead of passing it via the Infracluster.
   404  
   405  **Pros**
   406  
   407  - Clearer separation between the lifecycle management of the general cloud infrastructure required for the cluster and the actual managed control plane  (i.e. GKE in this example)
   408  - Follows the original intentions of an “infrastructure” and “control-plane” provider
   409  - Enables removal/addition of properties for a managed kubernetes that may be different compared to an unmanaged kubernetes
   410  - Works with ClusterClass
   411  
   412  **Cons**
   413  
   414  - Duplication of API definitions between GCPCluster and GCPManagedCluster and reconciliation for the infrastructure cluster
   415  
   416  ## Recommendations
   417  
   418  It is proposed that option 3 (two kinds with a managed control plane and managed infra cluster with better separation of responsibilities) is the best way to proceed for **new implementations** of managed Kubernetes in a provider where there is additional infrastructure required (e.g. VPC, resource groups).
   419  
   420  The reasons for this recommendation are as follows:
   421  
   422  - It adheres closely to the original separation of concerns between the infra and control plane providers
   423  - The infra cluster provisions and manages the general infrastructure required for the cluster but not the control plane.
   424  - By having a separate infra cluster API definition, it allows differences in the API between managed and unmanaged clusters.
   425  
   426  > This is the model currently adopted by the managed Kubernetes part of CAPG & CAPOCI and all non-managed K8s implementations.
   427  
   428  ### Vanilla Managed Kubernetes (i.e. without any additional infrastructure)
   429  
   430  If the managed Kubernetes services does not require any base infrastructure to be setup before creating the instance of the service then option 2 (Just a ControlPlane kind (and no InfraCluster) is the recommendation.
   431  
   432  This recommendation assumes that the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented. Until that point option 1 (Two kinds with a ControlPlane and a pass-through InfraCluster) will have to be used.
   433  
   434  ### Existing Managed Kubernetes Implementations
   435  
   436  Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3 (if there is additional infrastructure) or option 2 (if there isn't any have additional infrastructure).
   437  
   438  There is a desire to have consistency across all managed Kubernetes implementations and across all cluster types (i.e. managed and unmanaged) but the choice remains with the providers of existing implementations.
   439  
   440  ### Additional notes on option 3
   441  
   442  There are a number of cons listed for option 3. With having 2 API kinds for the infra cluster (and associated controllers), there is a risk of code duplication. To reduce this the 2 controllers can utilize shared reconciliation code from the different controllers so as to reduce this duplication.
   443  
   444  The user will need to be aware of when to use which specific infra cluster kind. In our example this means that a user will need to know when to use `GCPCluster` vs `GCPManagedCluster`. To give clear guidance to users, we will provide templates (including ClusterClasses) and documentation for both the unmanaged and managed varieties of clusters. If we used the same infra cluster kind across both unmanaged & managed (i.e. alternative 2) then we run the risk of complicating the API for the infra cluster & controller if the required properties diverge.
   445  
   446  ### Managed Node Groups for Worker Nodes
   447  
   448  Some cloud providers also offer Managed Node Groups as part of their Managed Kubernetes service as a way to provision worker nodes for a cluster. For example, in GCP there are Node Pools and in AWS there are EKS Managed Node Groups.
   449  
   450  There are 2 different ways to represent a group of machines in CAPI:
   451  
   452  - **MachineDeployments** - you specify the number of replicas of a machine template and CAPI will manage the creation of immutable Machine-Infrastructure Machine pairs via MachineSets. The user is responsible for explicitly declaring how many machines (a.k.a replicas) they want and these are provisioned and joined to the cluster.
   453  - **MachinePools** - are similar to MachineDeployments in that they specify a number of machine replicas to be created and joined to the cluster. However, instead of using MachineSets to manage the lifecycle of individual machines a provider implementer utilizes a cloud provided solution to manage the lifecycle of the individual machines instead. Generally with a pool you don’t have to define an exact amount of replicas and instead you have the option to supply a minimum and maximum number of nodes and let the cloud service manage the scaling up and down the number of replicas/nodes. Examples of cloud provided solutions are Auto Scale Groups (ASG) in AWS and Virtual Machine Scale Sets (VMSS) in Azure.
   454  
   455  With the implementation of a managed node group the cloud provider is responsible for managing the lifecycle of the individual machines that are used as nodes. This implies that a machine pool representation is needed which utilises a cloud provided solution to manage the lifecycle of machines.
   456  
   457  For our example, GCP offers Node Pools that will manage the lifecycle of a pool of machines that can scale up and down. We can use this service to implement machine pools:
   458  
   459  ```go
   460  type GCPManagedMachinePoolSpec struct {
   461      // Location specifies where the nodes should be created.
   462      Location []string `json:"location"`
   463  
   464      // The Kubernetes version for the node group.
   465      Version string `json:"version"`
   466  
   467      // MinNodeCount is the minimum number of nodes for one location.
   468      MinNodeCount int `json:"minNodeCount"`
   469  
   470      // MaxNodeCount is the maximum number of nodes for one location.
   471      MaxNodeCount int `json:"minNodeCount"`
   472  
   473      ...
   474  }
   475  ```
   476  
   477  ### Provider Implementers Documentation
   478  
   479  Its recommended that changes are made to the [Provider Implementers documentation](../book/src/developer/providers/cluster-infrastructure.md) based on the recommending approach for representing managed Kubernetes in Cluster API.
   480  
   481  Some of the areas of change (this is not an exhaustive list):
   482  
   483  - A new "implementing managed kubernetes" guide that contains details about how to represent a managed Kubernetes service in CAPI. The content will be based on the recommendations from this proposal along with other considerations such as managed node and addon management.
   484  - Update the [Provider contracts documentation](../book/src/developer/providers/contracts.md) to state that the same kind should not be used to satisfy 2 different provider contracts.
   485  - Update the [Cluster Infrastructure documentation](../book/src/developer/providers/cluster-infrastructure.md) to provide guidance on how to populate the `controlPlaneEndpoint` in the scenario where the control plane creates the api server load balancer. We should include sample code.
   486  - Update the [Control Plane Controller](../book/src/developer/architecture/controllers/control-plane.md) diagram for managed k8s services case. The Control Plane reconcile needs to start when `InfrastructureReady` is true.
   487  - Updates based on the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md).
   488  
   489  ## Other Considerations for CAPI
   490  
   491  ### ClusterClass support for MachinePool
   492  
   493  - MachinePool is an important feature for managed Kubernetes as it is preferred over MachineDeployment to fully utilize the native capabilities such as autoscaling, health-checking, zone balancing provided by cloud providers’ node groups.
   494  - AKS supports MachinePool based worker nodes only and ClusterClass support for MachinePool is required. See [issue](https://github.com/kubernetes-sigs/cluster-api/issues/5991)
   495  
   496  ### clusterctl integration
   497  
   498  - `clusterctl` assumes a minimal set of providers (core, bootstrap, control plane, infra) is required to form a valid management cluster. Currently, it does not expect a single provider being many things at the same time.
   499  - EKS in CAPA has its own control plane provider and a bootstrap provider packaged in a single manager. Moving forward, it would be great to separate them out.
   500  
   501  ### Add-ons management
   502  
   503  - EKS and AKS provide an ability to install add-ons (e.g., CNI, CSI, DNS) managed by cloud providers.
   504    - [EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/eks-add-ons.html)
   505    - [AKS add-ons](https://docs.microsoft.com/en-us/azure/aks/integrations)
   506  - CAPA and CAPZ enabled support for cloud provider managed addons via API
   507    - [CAPA](https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controlplane/eks/api/v1beta1/awsmanagedcontrolplane_types.go#L155)
   508    - [CAPZ](https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2095)
   509  - Managed Kubernetes implementations should be able to opt-in/opt-out of what will be provided by [CAPI’s add-ons orchestration solution](https://github.com/kubernetes-sigs/cluster-api/issues/5491)
   510  
   511  ## Alternatives
   512  
   513  A number of different representations where also considered but discounted.
   514  
   515  ### Alternative 1: Single kind for Control Plane and Infrastructure
   516  
   517  This option introduces a new single resource kind:
   518  
   519  - **GCPManagedControlPlane**: this represents both a control-plane (i.e. GKE) and infrastructure required for the cluster. It contains properties for both the general cloud infrastructure (that would traditionally be represented by an infrastructure cluster) and the managed Kubernetes control plane (that would traditionally be represented by a control plane provider).
   520  
   521  ```go
   522  type GCPManagedControlPlaneSpec struct {
   523      // Project is the name of the project to deploy the cluster to.
   524      Project string `json:"project"`
   525  
   526      // NetworkSpec encapsulates all things related to the GCP network.
   527      // +optional
   528      Network NetworkSpec `json:"network"`
   529  
   530      // AddonsConfig defines the addons to enable with the GKE cluster.
   531      // +optional
   532      AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"`
   533  
   534      // Logging contains the logging configuration for the GKE cluster.
   535      // +optional
   536      Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"`
   537  
   538      // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled
   539      // +optional
   540      EnableKubernetesAlpha bool
   541  
   542      // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane.
   543      // +optional
   544      ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"`
   545      ....
   546  }
   547  ```
   548  
   549  **This was the design pattern originally used for the EKS implementation in CAPA.**
   550  
   551  #### Background: Why did EKS in CAPA choose this option?
   552  
   553  CAPA decided to represent an EKS cluster as a CAPI control-plane. This meant that control-plane is responsible for creating the API server load balancer.
   554  
   555  Initially CAPA had an infrastructure cluster kind that reported back the control plane endpoint. This required less than ideal code in its controller to watch the control plane and use its value of the control plane endpoint.
   556  
   557  As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. _This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds._
   558  
   559  (Note: the above italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 merges is implemented.)
   560  
   561  Note that CAPZ had a similar discussion and an [issue](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1396) to remove AzureManagedCluster: AzureManagedCluster is useless; let's remove it (and keep AzureManagedControlPlane)
   562  
   563  **Pros**
   564  
   565  - A simple design with a single resource kind and controller.
   566  
   567  **Cons**
   568  
   569  - Doesn’t work with the current implementation of ClusterClass, which expects a separation of ControlPlane and Infrastructure.
   570  - Doesn’t provide separation of responsibilities between creating the general cloud infrastructure for the cluster and the actual cluster control plane.
   571  - Managed Kubernetes look different from unmanaged Kubernetes where two separate kinds are used for a control plane and infrastructure. This would impact products building on top of CAPI.
   572  
   573  ### Alternative 2: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities
   574  
   575  This option is a variation of option 3 and as such it more closely follows the original separation of concerns with the different CAPI provider types. The difference with this option compared to option 3 is that only 1 new resource kind is introduced:
   576  
   577  - **GCPManagedControlPlane**: this presents the actual GKE control plane in GCP. Its spec would only contain properties that are specific to provisioning & management of GKE. It would not contain any general properties related to the general GCP operating infrastructure, like the networking or project.
   578  
   579  The general cluster infrastructure will be declared via the existing **GCPCluster** kind and reconciled via the existing controller.
   580  
   581  However, this approach will require changes to the controller for **GCPCluster**. The steps to create the required infrastructure may be different between an unmanaged cluster and a GKE based cluster. For example, for an unmanaged cluster a load balancer will need to be created but with a GKE based cluster this won’t be needed and instead we’d need to use the endpoint created as part of **GCPManagedControlPlane** reconciliation.
   582  
   583  So the **GCPCluster** controller will need to know if its creating infrastructure for an unmanaged or managed cluster (probably by looking at the parent's (i.e. `Cluster`) **controlPlaneRef**) and do different steps.
   584  
   585  **Pros**
   586  
   587  - Single infra cluster kind irrespective of if you are creating an unmanaged or GKE based cluster. It doesn’t require the user to pick the right one.
   588  - Clear separation between cluster infrastructure and the actual managed (i.e. GKE) control plane
   589  - Works with cluster class
   590  
   591  **Cons**
   592  
   593  - Additional complexity and logic in the infra cluster controller
   594  - API definition could be messy if only certain fields are required for one type of cluster
   595  
   596  ## Upgrade Strategy
   597  
   598  As mentioned in the goals section, it is up to providers with existing implementations, CAPA and CAPZ, to decide how they want to proceed.
   599  
   600  - EKS and AKS are in different lifecycle stages, EKS is in GA vs. AKS is in the experimental stage. This requires different considerations for making breaking changes to APIs.
   601  - Their current design is different, EKS using option 1 while AKS using option 2. This makes it difficult to propose a single upgrade strategy.
   602  
   603  ## Implementation History
   604  
   605  - [x] 03/01/2022: Had a community meeting to discuss an issue regarding ClusterClass support for EKS and managed k8s in CAPI
   606  - [x] 03/17/2022: Compile a Google Doc following the CAEP template
   607  - [x] 04/20/2022: Present proposal at a community meeting
   608  - [x] 07/27/2022: Move the proposal to a PR in CAPI repo
   609  - [x] 06/15/2023: Updates as a result of the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) and also updates as a result of the current state of managed k8s in CAPI.