sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20220725-managed-kubernetes.md (about) 1 --- 2 title: Managed Kubernetes in CAPI 3 authors: 4 - “@pydctw” 5 - "@richardcase" 6 reviewers: 7 - “@alexeldeib” 8 - “@CecileRobertMichon” 9 - “@enxebre” 10 - “@fabriziopandini” 11 - “@jackfrancis” 12 - "@joekr" 13 - “@sbueringer” 14 - "@shyamradhakrishnan" 15 - "@yastij" 16 creation-date: 2022-07-25 17 last-updated: 2023-06-15 18 status: implementable 19 see-also: ./20230407-flexible-managed-k8s-endpoints.md 20 replaces: 21 superseded-by: 22 --- 23 24 # Managed Kubernetes in CAPI 25 26 ## Table of Contents 27 28 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 29 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 30 31 - [Glossary](#glossary) 32 - [Summary](#summary) 33 - [Motivation](#motivation) 34 - [Goals](#goals) 35 - [Non-Goals/Future Work](#non-goalsfuture-work) 36 - [Proposal](#proposal) 37 - [Personas](#personas) 38 - [Cluster Service Provider](#cluster-service-provider) 39 - [Cluster Service Consumer](#cluster-service-consumer) 40 - [Cluster Admin](#cluster-admin) 41 - [Cluster User](#cluster-user) 42 - [User Stories](#user-stories) 43 - [Story 1](#story-1) 44 - [Story 2](#story-2) 45 - [Story 3](#story-3) 46 - [Story 4](#story-4) 47 - [Story 5](#story-5) 48 - [Story 6](#story-6) 49 - [Current State of Managed Kubernetes in CAPI](#current-state-of-managed-kubernetes-in-capi) 50 - [EKS in CAPA](#eks-in-capa) 51 - [AKS in CAPZ](#aks-in-capz) 52 - [OKE in CAPOCI](#oke-in-capoci) 53 - [GKE in CAPG](#gke-in-capg) 54 - [Managed Kubernetes API Design Approaches](#managed-kubernetes-api-design-approaches) 55 - [Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster](#option-1-two-kinds-with-a-controlplane-and-a-pass-through-infracluster) 56 - [Option 2: Just a ControlPlane kind and no InfraCluster](#option-2-just-a-controlplane-kind-and-no-infracluster) 57 - [Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities](#option-3-two-kinds-with-a-managed-control-plane-and-managed-infra-cluster-with-better-separation-of-responsibilities) 58 - [Recommendations](#recommendations) 59 - [Vanilla Managed Kubernetes (i.e. without any additional infrastructure)](#vanilla-managed-kubernetes-ie-without-any-additional-infrastructure) 60 - [Existing Managed Kubernetes Implementations](#existing-managed-kubernetes-implementations) 61 - [Additional notes on option 3](#additional-notes-on-option-3) 62 - [Managed Node Groups for Worker Nodes](#managed-node-groups-for-worker-nodes) 63 - [Provider Implementers Documentation](#provider-implementers-documentation) 64 - [Other Considerations for CAPI](#other-considerations-for-capi) 65 - [ClusterClass support for MachinePool](#clusterclass-support-for-machinepool) 66 - [clusterctl integration](#clusterctl-integration) 67 - [Add-ons management](#add-ons-management) 68 - [Alternatives](#alternatives) 69 - [Alternative 1: Single kind for Control Plane and Infrastructure](#alternative-1-single-kind-for-control-plane-and-infrastructure) 70 - [Background: Why did EKS in CAPA choose this option?](#background-why-did-eks-in-capa-choose-this-option) 71 - [Alternative 2: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities](#alternative-2-two-kinds-with-a-managed-control-plane-and-shared-infra-cluster-with-better-separation-of-responsibilities) 72 - [Upgrade Strategy](#upgrade-strategy) 73 - [Implementation History](#implementation-history) 74 75 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 76 77 ## Glossary 78 79 - **Managed Kubernetes** - a Kubernetes service offered/hosted by a service provider where the control plane is run & managed by the service provider. As a cluster service consumer, you don’t have to worry about managing/operating the control plane machines. Additionally, the managed Kubernetes service may extend to cover running managed worker nodes. Examples are EKS in AWS and AKS in Azure. This is different from a traditional implementation in Cluster API, where the control plane and worker nodes are deployed and managed by the cluster admin. 80 - **Unmanaged Kubernetes** - a Kubernetes cluster where a cluster admin is responsible for provisioning and operating the control plane and worker nodes. In Cluster API this traditionally means a Kubeadm bootstrapped cluster on infrastructure machines (virtual or physical). 81 - **Managed Worker Node** - an individual Kubernetes worker node where the underlying compute (vm or bare-metal) is provisioned and managed by the service provider. This usually includes the joining of the newly provisioned node into a Managed Kubernetes cluster. The lifecycle is normally controlled via a higher level construct such as a Managed Node Group. 82 - **Managed Node Group** - is a service that a service provider offers that automates the provisioning of managed worker nodes. Depending on the service provider this group of nodes could contain a fixed number of replicas or it might contain a dynamic pool of replicas that auto-scales up and down. Examples are Node Pools in GCP and EKS managed node groups. 83 - **Cluster Infrastructure Provider (Infrastructure)** - an Infrastructure provider supplies whatever prerequisites are necessary for creating & running clusters such as networking, load balancers, firewall rules, and so on. ([docs](../book/src/developer/providers/cluster-infrastructure.md)) 84 - **ControlPlane Provider (ControlPlane)** - a control plane provider instantiates a Kubernetes control plane consisting of k8s control plane components such as kube-apiserver, etcd, kube-scheduler and kube-controller-manager. ([docs](../book/src/developer/architecture/controllers/control-plane.md#control-plane-provider)) 85 - **MachineDeployment** - a MachineDeployment orchestrates deployments over a fleet of MachineSets, which is an immutable abstraction over Machines. ([docs](../book/src/developer/architecture/controllers/machine-deployment.md)) 86 - **MachinePool (experimental)** - a MachinePool is similar to a MachineDeployment in that they both define configuration and policy for how a set of machines are managed. While the MachineDeployment uses MachineSets to orchestrate updates to the Machines, MachinePool delegates the responsibility to a cloud provider specific resource such as AWS Auto Scale Groups, GCP Managed Instance Groups, and Azure Virtual Machine Scale Sets. ([docs](./20190919-machinepool-api.md)) 87 88 ## Summary 89 90 This proposal discusses various options on how a managed Kubernetes services could be represented in Cluster API by providers. Recommendations will be made on which approach(s) to adopt for new implementations by providers with a view of eventually having consistency across provider implementations. 91 92 ## Motivation 93 94 Cluster API was originally designed with unmanaged Kubernetes clusters in mind as the cloud providers did not offer managed Kubernetes services (except GCP with GKE). However, all 3 main cloud providers (and many other cloud/service providers) now have managed Kubernetes services. 95 96 Some Cluster API Providers (i.e. Azure with AKS first and then AWS with EKS) have implemented support for their managed Kubernetes services. These implementations have followed the existing documentation & contracts (that were designed for unmanaged Kubernetes) and have ended up with 2 different implementations. 97 98 While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with other parts of CAPI assuming the two objects are different (Reference [issue here](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). 99 100 Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly. However, after the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented there is the option to supply only the control plane, but you still cannot supply the same resource for both. 101 102 The responsibilities between the CAPI control plane and infrastructure are blurred with a managed Kubernetes service like AKS or EKS. For example, when you create a EKS control plane in AWS it also creates infrastructure that CAPI would traditionally view as the responsibility of the cluster “infrastructure provider”. 103 104 A good example here is the API server load balancer: 105 106 - Currently CAPI expects the control plane load balancer to be created by the cluster infrastructure provider and for its endpoint to be returned via the `ControlPlaneEndpoint` on the InfraCluster. 107 - In AWS when creating an EKS cluster (which is the Kubernetes control plane), a load balancer is automatically created for you. When representing EKS in CAPI this naturally maps to a “control plane provider” but this causes complications as we need to report the endpoint back via the cluster infrastructure provider and not the control plane provider. 108 109 ### Goals 110 111 - Provide a recommendation of a consistent approach for representing Managed Kubernetes services in CAPI for new implementations. 112 - It would be ideal for there to be consistency between providers when it comes to representing Managed Kubernetes services. However, it's unrealistic to ask providers to refactor their existing implementations. 113 - Ensure the recommendation provides a working model for Managed Kubernetes integration with ClusterClass. 114 - As a result of the recommendations of this proposal we should update the [Provider Implementers](../book/src/developer/providers/implementers.md) documentation to aid with future provider implementations. 115 116 ### Non-Goals/Future Work 117 118 - Enforce the Managed Kubernetes recommendations as a requirement for Cluster API providers when they implement Managed Kubernetes. 119 - If providers that have already implemented Managed Kubernetes and would like guidance on if/how they could move to be aligned with the recommendations of this proposal then discussions should be facilitated. 120 - Provide advice in this proposal on how to refactor the existing implementations of managed Kubernetes in CAPA & CAPZ. 121 - Propose a new architecture or API changes to CAPI for managed Kubernetes. This has been covered by the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md). 122 - Be a concrete design for the GKE implementation in Cluster API Provider GCP (CAPG). 123 - Recommend how Managed Kubernetes services would leverage CAPI internally to run their offer. 124 125 ## Proposal 126 127 ### Personas 128 129 #### Cluster Service Provider 130 131 The user hosting cluster control planes, responsible for up-time, UI for fleet wide alerts, configuring a cloud account to host control planes in, views user provisioned infra (available compute). Has cluster admin management. 132 133 #### Cluster Service Consumer 134 135 A user empowered to request control planes, request workers to a service provider, and drive upgrades or modify externalized configuration. 136 137 #### Cluster Admin 138 139 A user with cluster-admin role in the provisioned cluster, but may or may not have power over when/how cluster is upgraded or configured. 140 141 #### Cluster User 142 143 A user who uses a provisioned cluster, which usually maps to a developer. 144 145 ### User Stories 146 147 #### Story 1 148 149 As a cluster service consumer, 150 I want to use Cluster API to provision and manage Kubernetes Clusters that utilize my service providers Managed Kubernetes Service (i.e. EKS, AKS, GKE), 151 So that I don’t have to worry about the management/provisioning of control plane nodes, and so I can take advantage of any value add services offered by the service provider. 152 153 #### Story 2 154 155 As a cluster service consumer, 156 I want to use Cluster API to provision and manage the lifecycle of worker nodes that utilizes my cloud providers’ managed instances (if they support them), 157 So that I don't have to worry about the management of these instances. 158 159 #### Story 3 160 161 As a cluster admin, 162 I want to be able to provision both “unmanaged” and “managed” Kubernetes clusters from the same management cluster, 163 So that I can support different requirements & use cases as needed whilst using a single operating model. 164 165 #### Story 4 166 167 As a Cluster API provider developer, 168 I want guidance on how to incorporate a managed Kubernetes service into my provider, 169 So that its usage is compatible with Cluster API architecture/features 170 And its usage is consistent with other providers. 171 172 #### Story 5 173 174 As a Cluster API provider developer, 175 I want to enable the ClusterClass feature for a managed Kubernetes service, 176 So that cluster users can take advantage of an improved UX with ClusterClass-based clusters. 177 178 #### Story 6 179 180 As a cluster service provider, 181 I want to be able to offer “Managed Kubernetes” powered by CAPI, 182 So that I can eliminate the responsibility of owning and SREing the Control Plane from the Cluster service consumer and cluster admin. 183 184 ### Current State of Managed Kubernetes in CAPI 185 186 #### EKS in CAPA 187 188 - [Docs](https://cluster-api-aws.sigs.k8s.io/topics/eks/index.html) 189 - Feature Status: GA 190 - CRDs 191 - AWSManagedControlPlane - provision EKS cluster 192 - AWSManagedMachinePool - corresponds to EKS managed node pool 193 - Supported Flavors 194 - AWSManagedControlPlane with MachineDeployment / AWSMachine 195 - AWSManagedControlPlane with MachinePool / AWSMachinePool 196 - AWSManagedControlPlane with MachinePool / AWSManagedMachinePool 197 - Bootstrap Provider 198 - Cluster API bootstrap provider EKS (CABPE) 199 - Features 200 - Provisioning/managing an Amazon EKS Cluster 201 - Upgrading the Kubernetes version of the EKS Cluster 202 - Attaching self-managed machines as nodes to the EKS cluster 203 - Creating a machine pool and attaching it to the EKS cluster (experimental) 204 - Creating a managed machine pool and attaching it to the EKS cluster 205 - Managing “EKS Addons” 206 - Creating an EKS Fargate profile (experimental) 207 - Managing aws-iam-authenticator configuration 208 209 #### AKS in CAPZ 210 211 - [Docs](https://capz.sigs.k8s.io/topics/managedcluster.html) 212 - Feature Status: GA 213 - CRDs 214 - AzureManagedControlPlane, AzureManagedCluster - provision AKS cluster 215 - AzureManagedMachinePool - corresponds to AKS node pool 216 - Supported Flavor 217 - AzureManagedControlPlane + AzureManagedCluster with AzureManagedMachinePool 218 219 #### OKE in CAPOCI 220 221 - [Docs](https://oracle.github.io/cluster-api-provider-oci/managed/managedcluster.html) 222 - Feature Status: Experimental 223 - CRDs 224 - OCIManagedControlPlane, OCIManagedCluster - provision OKE cluster 225 - OCIManagedMachinePool, OCIVirtualMachinePool - machine pool implementations 226 - Supported Flavors: 227 - OCIManagedControlPlane + OCIManagedCluster with OCIManagedMachinePool 228 - OCIManagedControlPlane + OCIManagedCluster with OCIVirtualMachinePool 229 230 #### GKE in CAPG 231 232 - [Docs](https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/main/docs/book/src/topics/gke/index.md) 233 - Feature Status: Experimental 234 - CRDs 235 - GCPManagedControlPlane, GCPManagedCluster - provision GKE cluster 236 - GCPManagedMachinePool - corresponds to managed node pool 237 - Support flavor 238 - GCPManagedControlPlane + GCPManagedCluster with GCPManagedMachinePool 239 240 ### Managed Kubernetes API Design Approaches 241 242 When discussing the different approaches to represent a managed Kubernetes service in CAPI, we will be using the implementation of GKE support in CAPG as an example. 243 244 > NOTE: “naming things is hard” so the names of the kinds/structs/fields used in the CAPG examples below are illustrative only and are not the focus of this proposal. There is debate, for example, as to whether `GCPManagedCluster` or `GKECluster` should be used. 245 246 The following section discusses different API implementation options along with pros and cons of each. 247 248 #### Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster 249 250 **This option will be no longer needed when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented as option 2 can be used for a simpler solution** 251 252 This option introduces 2 new resource kinds: 253 254 - **GCPManagedControlPlane**: this represents both a control-plane (i.e. GKE) and infrastructure required for the cluster. It contains properties for both the general cloud infrastructure (that would traditionally be represented by an infrastructure cluster) and the managed Kubernetes control plane (that would traditionally be represented by a control plane provider). 255 - **GCPManagedCluster**: contains the minimum properties in its spec and status to satisfy the [CAPI contract for an infrastructure cluster](../book/src/developer/providers/cluster-infrastructure.md) (i.e. ControlPlaneEndpoint, Ready condition). Its controller watches GCPManagedControlPlane and copies the ControlPlaneEndpoint field to GCPManagedCluster to report back to CAPI. This is used as a pass-through layer only. 256 257 ```go 258 type GCPManagedControlPlaneSpec struct { 259 // Project is the name of the project to deploy the cluster to. 260 Project string `json:"project"` 261 262 // NetworkSpec encapsulates all things related to the GCP network. 263 // +optional 264 Network NetworkSpec `json:"network"` 265 266 // AddonsConfig defines the addons to enable with the GKE cluster. 267 // +optional 268 AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` 269 270 // Logging contains the logging configuration for the GKE cluster. 271 // +optional 272 Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` 273 274 // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled 275 // +optional 276 EnableKubernetesAlpha bool 277 278 // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. 279 // +optional 280 ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` 281 .... 282 } 283 ``` 284 285 ```go 286 type GCPManagedClusterSpec struct { 287 // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. 288 // +optional 289 ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` 290 } 291 ``` 292 293 **This is the design pattern currently used by CAPZ and CAPA**. [An example of how ManagedCluster watches ControlPlane in CAPZ.](https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/5c69b44ed847365525504b242da83b5e5da75e4f/controllers/azuremanagedcluster_controller.go#L71) 294 295 **Pros** 296 297 - Better aligned with CAPI’s traditional infra provider model 298 - Works with ClusterClass 299 300 **Cons** 301 302 - Need to maintain Infra cluster kind, which is a pass-through layer and has no other functions. In addition to the CRD, controllers, webhooks and conversions webhooks need to be maintained. 303 - Infra provider doesn’t provision infrastructure and whilst it may meet the CAPI contract, it doesn’t actually create infrastructure as this is done via the control plane. 304 305 #### Option 2: Just a ControlPlane kind and no InfraCluster 306 307 **This option is enabled when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented.** 308 309 This option introduces 1 new resource kind: 310 311 - **GCPManagedControlPlane**: this represents a control-plane (i.e. GKE) required for the cluster. It contains properties for the managed Kubernetes control plane (that would traditionally be represented by a control plane provider). 312 313 ```go 314 type GCPManagedControlPlaneSpec struct { 315 // Project is the name of the project to deploy the cluster to. 316 Project string `json:"project"` 317 318 // AddonsConfig defines the addons to enable with the GKE cluster. 319 // +optional 320 AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` 321 322 // Logging contains the logging configuration for the GKE cluster. 323 // +optional 324 Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` 325 326 // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled 327 // +optional 328 EnableKubernetesAlpha bool 329 330 // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. 331 // +optional 332 ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` 333 .... 334 } 335 ``` 336 337 **Pros** 338 339 - Simpler implementation 340 - No need for a pass-through infra cluster as control plane endpoint can be reported back via the control plane 341 - Works with ClusterClass 342 343 **Cons** 344 345 - If the configuration/functionality related to the base infrastructure are included then we have mixed concerns of the API type. 346 347 #### Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities 348 349 This option more closely follows the original separation of concerns with the different CAPI provider types. With this option, 2 new resource kinds will be introduced: 350 351 - **GCPManagedControlPlane**: this presents the actual GKE control plane in GCP. Its spec would only contain properties that are specific to the provisioning & management of a GKE cluster in GCP (excluding worker nodes). It would not contain any properties related to the general GCP operating infrastructure, like the networking or project. 352 - **GCPManagedCluster**: this presents the properties needed to provision and manage the general GCP operating infrastructure for the cluster (i.e project, networking, iam). It would contain similar properties to **GCPCluster** and its reconciliation would be very similar. 353 354 ```go 355 type GCPManagedControlPlaneSpec struct { 356 // AddonsConfig defines the addons to enable with the GKE cluster. 357 // +optional 358 AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` 359 360 // Logging contains the logging configuration for the GKE cluster. 361 // +optional 362 Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` 363 364 // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled 365 // +optional 366 EnableKubernetesAlpha bool 367 368 ... 369 } 370 ``` 371 372 ```go 373 type GCPManagedClusterSpec struct { 374 // Project is the name of the project to deploy the cluster to. 375 Project string `json:"project"` 376 377 // The GCP Region the cluster lives in. 378 Region string `json:"region"` 379 380 // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. 381 // +optional 382 ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` 383 384 // NetworkSpec encapsulates all things related to the GCP network. 385 // +optional 386 Network NetworkSpec `json:"network"` 387 388 // FailureDomains is an optional field which is used to assign selected availability zones to a cluster 389 // FailureDomains if empty, defaults to all the zones in the selected region and if specified would override 390 // the default zones. 391 // +optional 392 FailureDomains []string `json:"failureDomains,omitempty"` 393 394 // AdditionalLabels is an optional set of tags to add to GCP resources managed by the GCP provider, in addition to the 395 // ones added by default. 396 // +optional 397 AdditionalLabels Labels `json:"additionalLabels,omitempty"` 398 399 ... 400 } 401 ``` 402 403 When the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented there is the option to return the control plane endpoint directly from the ControlPlane instead of passing it via the Infracluster. 404 405 **Pros** 406 407 - Clearer separation between the lifecycle management of the general cloud infrastructure required for the cluster and the actual managed control plane (i.e. GKE in this example) 408 - Follows the original intentions of an “infrastructure” and “control-plane” provider 409 - Enables removal/addition of properties for a managed kubernetes that may be different compared to an unmanaged kubernetes 410 - Works with ClusterClass 411 412 **Cons** 413 414 - Duplication of API definitions between GCPCluster and GCPManagedCluster and reconciliation for the infrastructure cluster 415 416 ## Recommendations 417 418 It is proposed that option 3 (two kinds with a managed control plane and managed infra cluster with better separation of responsibilities) is the best way to proceed for **new implementations** of managed Kubernetes in a provider where there is additional infrastructure required (e.g. VPC, resource groups). 419 420 The reasons for this recommendation are as follows: 421 422 - It adheres closely to the original separation of concerns between the infra and control plane providers 423 - The infra cluster provisions and manages the general infrastructure required for the cluster but not the control plane. 424 - By having a separate infra cluster API definition, it allows differences in the API between managed and unmanaged clusters. 425 426 > This is the model currently adopted by the managed Kubernetes part of CAPG & CAPOCI and all non-managed K8s implementations. 427 428 ### Vanilla Managed Kubernetes (i.e. without any additional infrastructure) 429 430 If the managed Kubernetes services does not require any base infrastructure to be setup before creating the instance of the service then option 2 (Just a ControlPlane kind (and no InfraCluster) is the recommendation. 431 432 This recommendation assumes that the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented. Until that point option 1 (Two kinds with a ControlPlane and a pass-through InfraCluster) will have to be used. 433 434 ### Existing Managed Kubernetes Implementations 435 436 Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3 (if there is additional infrastructure) or option 2 (if there isn't any have additional infrastructure). 437 438 There is a desire to have consistency across all managed Kubernetes implementations and across all cluster types (i.e. managed and unmanaged) but the choice remains with the providers of existing implementations. 439 440 ### Additional notes on option 3 441 442 There are a number of cons listed for option 3. With having 2 API kinds for the infra cluster (and associated controllers), there is a risk of code duplication. To reduce this the 2 controllers can utilize shared reconciliation code from the different controllers so as to reduce this duplication. 443 444 The user will need to be aware of when to use which specific infra cluster kind. In our example this means that a user will need to know when to use `GCPCluster` vs `GCPManagedCluster`. To give clear guidance to users, we will provide templates (including ClusterClasses) and documentation for both the unmanaged and managed varieties of clusters. If we used the same infra cluster kind across both unmanaged & managed (i.e. alternative 2) then we run the risk of complicating the API for the infra cluster & controller if the required properties diverge. 445 446 ### Managed Node Groups for Worker Nodes 447 448 Some cloud providers also offer Managed Node Groups as part of their Managed Kubernetes service as a way to provision worker nodes for a cluster. For example, in GCP there are Node Pools and in AWS there are EKS Managed Node Groups. 449 450 There are 2 different ways to represent a group of machines in CAPI: 451 452 - **MachineDeployments** - you specify the number of replicas of a machine template and CAPI will manage the creation of immutable Machine-Infrastructure Machine pairs via MachineSets. The user is responsible for explicitly declaring how many machines (a.k.a replicas) they want and these are provisioned and joined to the cluster. 453 - **MachinePools** - are similar to MachineDeployments in that they specify a number of machine replicas to be created and joined to the cluster. However, instead of using MachineSets to manage the lifecycle of individual machines a provider implementer utilizes a cloud provided solution to manage the lifecycle of the individual machines instead. Generally with a pool you don’t have to define an exact amount of replicas and instead you have the option to supply a minimum and maximum number of nodes and let the cloud service manage the scaling up and down the number of replicas/nodes. Examples of cloud provided solutions are Auto Scale Groups (ASG) in AWS and Virtual Machine Scale Sets (VMSS) in Azure. 454 455 With the implementation of a managed node group the cloud provider is responsible for managing the lifecycle of the individual machines that are used as nodes. This implies that a machine pool representation is needed which utilises a cloud provided solution to manage the lifecycle of machines. 456 457 For our example, GCP offers Node Pools that will manage the lifecycle of a pool of machines that can scale up and down. We can use this service to implement machine pools: 458 459 ```go 460 type GCPManagedMachinePoolSpec struct { 461 // Location specifies where the nodes should be created. 462 Location []string `json:"location"` 463 464 // The Kubernetes version for the node group. 465 Version string `json:"version"` 466 467 // MinNodeCount is the minimum number of nodes for one location. 468 MinNodeCount int `json:"minNodeCount"` 469 470 // MaxNodeCount is the maximum number of nodes for one location. 471 MaxNodeCount int `json:"minNodeCount"` 472 473 ... 474 } 475 ``` 476 477 ### Provider Implementers Documentation 478 479 Its recommended that changes are made to the [Provider Implementers documentation](../book/src/developer/providers/cluster-infrastructure.md) based on the recommending approach for representing managed Kubernetes in Cluster API. 480 481 Some of the areas of change (this is not an exhaustive list): 482 483 - A new "implementing managed kubernetes" guide that contains details about how to represent a managed Kubernetes service in CAPI. The content will be based on the recommendations from this proposal along with other considerations such as managed node and addon management. 484 - Update the [Provider contracts documentation](../book/src/developer/providers/contracts.md) to state that the same kind should not be used to satisfy 2 different provider contracts. 485 - Update the [Cluster Infrastructure documentation](../book/src/developer/providers/cluster-infrastructure.md) to provide guidance on how to populate the `controlPlaneEndpoint` in the scenario where the control plane creates the api server load balancer. We should include sample code. 486 - Update the [Control Plane Controller](../book/src/developer/architecture/controllers/control-plane.md) diagram for managed k8s services case. The Control Plane reconcile needs to start when `InfrastructureReady` is true. 487 - Updates based on the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md). 488 489 ## Other Considerations for CAPI 490 491 ### ClusterClass support for MachinePool 492 493 - MachinePool is an important feature for managed Kubernetes as it is preferred over MachineDeployment to fully utilize the native capabilities such as autoscaling, health-checking, zone balancing provided by cloud providers’ node groups. 494 - AKS supports MachinePool based worker nodes only and ClusterClass support for MachinePool is required. See [issue](https://github.com/kubernetes-sigs/cluster-api/issues/5991) 495 496 ### clusterctl integration 497 498 - `clusterctl` assumes a minimal set of providers (core, bootstrap, control plane, infra) is required to form a valid management cluster. Currently, it does not expect a single provider being many things at the same time. 499 - EKS in CAPA has its own control plane provider and a bootstrap provider packaged in a single manager. Moving forward, it would be great to separate them out. 500 501 ### Add-ons management 502 503 - EKS and AKS provide an ability to install add-ons (e.g., CNI, CSI, DNS) managed by cloud providers. 504 - [EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/eks-add-ons.html) 505 - [AKS add-ons](https://docs.microsoft.com/en-us/azure/aks/integrations) 506 - CAPA and CAPZ enabled support for cloud provider managed addons via API 507 - [CAPA](https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controlplane/eks/api/v1beta1/awsmanagedcontrolplane_types.go#L155) 508 - [CAPZ](https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2095) 509 - Managed Kubernetes implementations should be able to opt-in/opt-out of what will be provided by [CAPI’s add-ons orchestration solution](https://github.com/kubernetes-sigs/cluster-api/issues/5491) 510 511 ## Alternatives 512 513 A number of different representations where also considered but discounted. 514 515 ### Alternative 1: Single kind for Control Plane and Infrastructure 516 517 This option introduces a new single resource kind: 518 519 - **GCPManagedControlPlane**: this represents both a control-plane (i.e. GKE) and infrastructure required for the cluster. It contains properties for both the general cloud infrastructure (that would traditionally be represented by an infrastructure cluster) and the managed Kubernetes control plane (that would traditionally be represented by a control plane provider). 520 521 ```go 522 type GCPManagedControlPlaneSpec struct { 523 // Project is the name of the project to deploy the cluster to. 524 Project string `json:"project"` 525 526 // NetworkSpec encapsulates all things related to the GCP network. 527 // +optional 528 Network NetworkSpec `json:"network"` 529 530 // AddonsConfig defines the addons to enable with the GKE cluster. 531 // +optional 532 AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` 533 534 // Logging contains the logging configuration for the GKE cluster. 535 // +optional 536 Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` 537 538 // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled 539 // +optional 540 EnableKubernetesAlpha bool 541 542 // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. 543 // +optional 544 ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` 545 .... 546 } 547 ``` 548 549 **This was the design pattern originally used for the EKS implementation in CAPA.** 550 551 #### Background: Why did EKS in CAPA choose this option? 552 553 CAPA decided to represent an EKS cluster as a CAPI control-plane. This meant that control-plane is responsible for creating the API server load balancer. 554 555 Initially CAPA had an infrastructure cluster kind that reported back the control plane endpoint. This required less than ideal code in its controller to watch the control plane and use its value of the control plane endpoint. 556 557 As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. _This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds._ 558 559 (Note: the above italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 merges is implemented.) 560 561 Note that CAPZ had a similar discussion and an [issue](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1396) to remove AzureManagedCluster: AzureManagedCluster is useless; let's remove it (and keep AzureManagedControlPlane) 562 563 **Pros** 564 565 - A simple design with a single resource kind and controller. 566 567 **Cons** 568 569 - Doesn’t work with the current implementation of ClusterClass, which expects a separation of ControlPlane and Infrastructure. 570 - Doesn’t provide separation of responsibilities between creating the general cloud infrastructure for the cluster and the actual cluster control plane. 571 - Managed Kubernetes look different from unmanaged Kubernetes where two separate kinds are used for a control plane and infrastructure. This would impact products building on top of CAPI. 572 573 ### Alternative 2: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities 574 575 This option is a variation of option 3 and as such it more closely follows the original separation of concerns with the different CAPI provider types. The difference with this option compared to option 3 is that only 1 new resource kind is introduced: 576 577 - **GCPManagedControlPlane**: this presents the actual GKE control plane in GCP. Its spec would only contain properties that are specific to provisioning & management of GKE. It would not contain any general properties related to the general GCP operating infrastructure, like the networking or project. 578 579 The general cluster infrastructure will be declared via the existing **GCPCluster** kind and reconciled via the existing controller. 580 581 However, this approach will require changes to the controller for **GCPCluster**. The steps to create the required infrastructure may be different between an unmanaged cluster and a GKE based cluster. For example, for an unmanaged cluster a load balancer will need to be created but with a GKE based cluster this won’t be needed and instead we’d need to use the endpoint created as part of **GCPManagedControlPlane** reconciliation. 582 583 So the **GCPCluster** controller will need to know if its creating infrastructure for an unmanaged or managed cluster (probably by looking at the parent's (i.e. `Cluster`) **controlPlaneRef**) and do different steps. 584 585 **Pros** 586 587 - Single infra cluster kind irrespective of if you are creating an unmanaged or GKE based cluster. It doesn’t require the user to pick the right one. 588 - Clear separation between cluster infrastructure and the actual managed (i.e. GKE) control plane 589 - Works with cluster class 590 591 **Cons** 592 593 - Additional complexity and logic in the infra cluster controller 594 - API definition could be messy if only certain fields are required for one type of cluster 595 596 ## Upgrade Strategy 597 598 As mentioned in the goals section, it is up to providers with existing implementations, CAPA and CAPZ, to decide how they want to proceed. 599 600 - EKS and AKS are in different lifecycle stages, EKS is in GA vs. AKS is in the experimental stage. This requires different considerations for making breaking changes to APIs. 601 - Their current design is different, EKS using option 1 while AKS using option 2. This makes it difficult to propose a single upgrade strategy. 602 603 ## Implementation History 604 605 - [x] 03/01/2022: Had a community meeting to discuss an issue regarding ClusterClass support for EKS and managed k8s in CAPI 606 - [x] 03/17/2022: Compile a Google Doc following the CAEP template 607 - [x] 04/20/2022: Present proposal at a community meeting 608 - [x] 07/27/2022: Move the proposal to a PR in CAPI repo 609 - [x] 06/15/2023: Updates as a result of the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) and also updates as a result of the current state of managed k8s in CAPI.