sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20201020-capi-provider-operator.md (about) 1 --- 2 title: CAPI Provider Operator 3 authors: 4 - "@fabriziopandini" 5 - "@wfernandes" 6 reviewers: 7 - "@vincepri" 8 - "@ncdc" 9 - "@justinsb" 10 - "@detiber" 11 - "@CecileRobertMichon" 12 creation-date: 2020-09-14 13 last-updated: 2021-01-20 14 status: implementable 15 see-also: 16 https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191016-clusterctl-redesign.md 17 --- 18 19 # CAPI Provider operator 20 21 ## Table of Contents 22 23 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 24 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 25 26 - [Glossary](#glossary) 27 - [Summary](#summary) 28 - [Motivation](#motivation) 29 - [Goals](#goals) 30 - [Non-Goals/Future Work](#non-goalsfuture-work) 31 - [Proposal](#proposal) 32 - [User Stories](#user-stories) 33 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 34 - [Clusterctl](#clusterctl) 35 - [Existing API Types Changes](#existing-api-types-changes) 36 - [New API Types](#new-api-types) 37 - [Example API Usage](#example-api-usage) 38 - [Operator Behaviors](#operator-behaviors) 39 - [Installing a provider](#installing-a-provider) 40 - [Upgrading a provider](#upgrading-a-provider) 41 - [Upgrades providers without changing contract](#upgrades-providers-without-changing-contract) 42 - [Upgrades providers and changing contract](#upgrades-providers-and-changing-contract) 43 - [Changing a provider](#changing-a-provider) 44 - [Deleting a provider](#deleting-a-provider) 45 - [Upgrade from v1alpha3 management cluster to v1alpha4 cluster](#upgrade-from-v1alpha3-management-cluster-to-v1alpha4-cluster) 46 - [Operator Lifecycle Management](#operator-lifecycle-management) 47 - [Operator Installation](#operator-installation) 48 - [Operator Upgrade](#operator-upgrade) 49 - [Operator Delete](#operator-delete) 50 - [Air gapped environment](#air-gapped-environment) 51 - [Risks and Mitigation](#risks-and-mitigation) 52 - [Error Handling & Logging](#error-handling--logging) 53 - [Extensibility Options](#extensibility-options) 54 - [Upgrade from v1alpha3 management cluster to v1alpha4/operator cluster](#upgrade-from-v1alpha3-management-cluster-to-v1alpha4operator-cluster) 55 - [Additional Details](#additional-details) 56 - [Test Plan](#test-plan) 57 - [Version Skew Strategy](#version-skew-strategy) 58 - [Implementation History](#implementation-history) 59 - [Controller Runtime Types](#controller-runtime-types) 60 61 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 62 63 ## Glossary 64 65 The lexicon used in this document is described in more detail 66 [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/reference/glossary.md). 67 Any discrepancies should be rectified in the main Cluster API glossary. 68 69 ## Summary 70 71 The clusterctl CLI currently handles the lifecycle of Cluster API 72 providers installed in a management cluster. It provides a great Day 0 and Day 73 1 experience in getting CAPI up and running. However, clusterctl’s imperative 74 design makes it difficult for cluster admins to stand up and manage CAPI 75 management clusters in their own preferred way. 76 77 This proposal provides a solution that leverages a declarative API and an 78 operator to empower admins to handle the lifecycle of providers within the 79 management cluster. 80 81 The operator is developed in a separate repository [TBD] and will have its own release cycle. 82 83 ## Motivation 84 85 In its current form clusterctl is designed to provide a simple user experience 86 for day 1 operations of a Cluster API management cluster. 87 88 However such design is not optimized for supporting declarative approaches 89 when operating Cluster API management clusters. 90 91 These declarative approaches are important to enable GitOps workflows in case 92 users don't want to rely solely on the `clusterctl` CLI. 93 94 Providing a declarative API also enables us to leverage controller-runtime's 95 new component config and allow us to configure the controller manager and even 96 the resource limits of the provider's deployment. 97 98 Another example is improving cluster upgrades. In order to upgrade a cluster 99 we now need to supply all the information that was provided initially during a 100 `clusterctl init` which is inconvenient in many cases such as distributed 101 teams and CI pipelines where the configuration needs to be stored and synced 102 externally. 103 104 With the management cluster operator, we aim to address these use cases by 105 introducing an operator that handles the lifecycle of providers within the 106 management cluster based on a declarative API. 107 108 ### Goals 109 110 - Define an API that enables declarative management of the lifecycle of 111 Cluster API and all of its providers. 112 - Support air-gapped environments through sufficient documentation initially. 113 - Identify and document differences between clusterctl CLI and the operator in 114 managing the lifecycle of providers, if any. 115 - Define how the clusterctl CLI should be changed in order to interact with 116 the management cluster operator in a transparent and effective way. 117 - To support the ability to upgrade from a v1alpha3 based version (v0.3.[TBD]) 118 of Cluster API to one managed by the operator. 119 120 ### Non-Goals/Future Work 121 122 - `clusterctl` related changes will be implemented after core operator functionality 123 is complete. For example, deprecating `Provider` type and migrating to new ones. 124 - `clusterctl` will not be deprecated or replaced with another CLI. 125 - Implement an operator driven version of `clusterctl move`. 126 - Manage cert-manager using the operator. 127 - Support multiple installations of the same provider within a management 128 cluster in light of [issue 3042] and [issue 3354]. 129 - Support any template processing engines. 130 - Support the installation of v1alpha3 providers using the operator. 131 132 ## Proposal 133 134 ### User Stories 135 136 1. As an admin, I want to use a declarative style API to operate the Cluster 137 API providers in a management cluster. 138 1. As an admin, I would like to have an easy and declarative way to change 139 controller settings (e.g. enabling pprof for debugging). 140 1. As an admin, I would like to have an easy and declarative way to change the 141 resource requirements (e.g. such as limits and requests for a provider 142 deployment). 143 1. As an admin, I would like to have the option to use clusterctl CLI as of 144 today, without being concerned about the operator. 145 1. As an admin, I would like to be able to install the operator using kubectl 146 apply, without being forced to use clusterctl. 147 148 ### Implementation Details/Notes/Constraints 149 150 ### Clusterctl 151 152 The `clusterctl` CLI will provide a similar UX to the users whilst leveraging 153 the operator for the functions it can. As stated in the Goals/Non-Goals, the 154 move operation will not be driven by the operator but rather remain within the 155 CLI for now. However, this is an implementation detail and will not affect the 156 users. The move operation and all other `clusterctl` refactoring will be 157 done after core operator functionality is implemented. 158 159 #### Existing API Types Changes 160 161 The existing `Provider` type used by the clusterctl CLI will be deprecated and 162 its instances will be migrated to instances of the new API types as defined in 163 the next section. 164 165 The management cluster operator will be responsible for migrating the existing 166 provider types to support GitOps workflows excluding `clusterctl`. 167 168 #### New API Types 169 170 These are the new API types being defined. 171 172 There are separate types for each provider type - Core, Bootstrap, 173 ControlPlane, and Infrastructure. However, since each type is similar, their 174 Spec and Status uses the shared types - `ProviderSpec`, `ProviderStatus` 175 respectively. 176 177 We will scope the CRDs to be namespaced. This will allow us to enforce 178 RBAC restrictions if needed. This also allows us to install multiple 179 versions of the controllers (grouped within namespaces) in the same 180 management cluster although this scenario will not be supported natively in 181 the v1alpha4 iteration. 182 183 If you prefer to see how the API can be used instead of reading the type 184 definition feel free to jump to the [Example API Usage 185 section](#example-api-usage) 186 187 ```golang 188 // CoreProvider is the Schema for the CoreProviders API 189 type CoreProvider struct { 190 metav1.TypeMeta `json:",inline"` 191 metav1.ObjectMeta `json:"metadata,omitempty"` 192 193 Spec ProviderSpec `json:"spec,omitempty"` 194 Status ProviderStatus `json:"status,omitempty"` 195 } 196 197 // BootstrapProvider is the Schema for the BootstrapProviders API 198 type BootstrapProvider struct { 199 metav1.TypeMeta `json:",inline"` 200 metav1.ObjectMeta `json:"metadata,omitempty"` 201 202 Spec ProviderSpec `json:"spec,omitempty"` 203 Status ProviderStatus `json:"status,omitempty"` 204 } 205 206 // ControlPlaneProvider is the Schema for the ControlPlaneProviders API 207 type ControlPlaneProvider struct { 208 metav1.TypeMeta `json:",inline"` 209 metav1.ObjectMeta `json:"metadata,omitempty"` 210 211 Spec ProviderSpec `json:"spec,omitempty"` 212 Status ProviderStatus `json:"status,omitempty"` 213 } 214 215 // InfrastructureProvider is the Schema for the InfrastructureProviders API 216 type InfrastructureProvider struct { 217 metav1.TypeMeta `json:",inline"` 218 metav1.ObjectMeta `json:"metadata,omitempty"` 219 220 Spec ProviderSpec `json:"spec,omitempty"` 221 Status ProviderStatus `json:"status,omitempty"` 222 } 223 ``` 224 225 Below you can find details about `ProviderSpec`, `ProviderStatus`, which is 226 shared among all the provider types - Core, Bootstrap, ControlPlane, and 227 Infrastructure. 228 229 ```golang 230 // ProviderSpec defines the desired state of the Provider. 231 type ProviderSpec struct { 232 // Version indicates the provider version. 233 // +optional 234 Version *string `json:"version,omitempty"` 235 236 // Manager defines the properties that can be enabled on the controller manager for the provider. 237 // +optional 238 Manager ManagerSpec `json:"manager,omitempty"` 239 240 // Deployment defines the properties that can be enabled on the deployment for the provider. 241 // +optional 242 Deployment *DeploymentSpec `json:"deployment,omitempty"` 243 244 // SecretName is the name of the Secret providing the configuration 245 // variables for the current provider instance, like e.g. credentials. 246 // Such configurations will be used when creating or upgrading provider components. 247 // The contents of the secret will be treated as immutable. If changes need 248 // to be made, a new object can be created and the name should be updated. 249 // The contents should be in the form of key:value. This secret must be in 250 // the same namespace as the provider. 251 // +optional 252 SecretName *string `json:"secretName,omitempty"` 253 254 // FetchConfig determines how the operator will fetch the components and metadata for the provider. 255 // If nil, the operator will try to fetch components according to default 256 // embedded fetch configuration for the given kind and `ObjectMeta.Name`. 257 // For example, the infrastructure name `aws` will fetch artifacts from 258 // https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases. 259 // +optional 260 FetchConfig *FetchConfiguration `json:"fetchConfig,omitempty"` 261 262 // Paused prevents the operator from reconciling the provider. This can be 263 // used when doing an upgrade or move action manually. 264 // +optional 265 Paused bool `json:"paused,omitempty"` 266 } 267 268 // ManagerSpec defines the properties that can be enabled on the controller manager for the provider. 269 type ManagerSpec struct { 270 // ControllerManagerConfigurationSpec defines the desired state of GenericControllerManagerConfiguration. 271 ctrlruntime.ControllerManagerConfigurationSpec `json:",inline"` 272 273 // ProfilerAddress defines the bind address to expose the pprof profiler (e.g. localhost:6060). 274 // Default empty, meaning the profiler is disabled. 275 // Controller Manager flag is --profiler-address. 276 // +optional 277 ProfilerAddress *string `json:"profilerAddress,omitempty"` 278 279 // MaxConcurrentReconciles is the maximum number of concurrent Reconciles 280 // which can be run. Defaults to 10. 281 // +optional 282 MaxConcurrentReconciles *int `json:"maxConcurrentReconciles,omitempty"` 283 284 // Verbosity set the logs verbosity. Defaults to 1. 285 // Controller Manager flag is --verbosity. 286 // +optional 287 Verbosity int `json:"verbosity,omitempty"` 288 289 // Debug, if set, will override a set of fields with opinionated values for 290 // a debugging session. (Verbosity=5, ProfilerAddress=localhost:6060) 291 // +optional 292 Debug bool `json:"debug,omitempty"` 293 294 // FeatureGates define provider specific feature flags that will be passed 295 // in as container args to the provider's controller manager. 296 // Controller Manager flag is --feature-gates. 297 FeatureGates map[string]bool `json:"featureGates,omitempty"` 298 } 299 300 // DeploymentSpec defines the properties that can be enabled on the Deployment for the provider. 301 type DeploymentSpec struct { 302 // Number of desired pods. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1. 303 // +optional 304 Replicas *int `json:"replicas,omitempty"` 305 306 // NodeSelector is a selector which must be true for the pod to fit on a node. 307 // Selector which must match a node's labels for the pod to be scheduled on that node. 308 // More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ 309 // +optional 310 NodeSelector map[string]string `json:"nodeSelector,omitempty"` 311 312 // If specified, the pod's tolerations. 313 // +optional 314 Tolerations []corev1.Toleration `json:"tolerations,omitempty"` 315 316 // If specified, the pod's scheduling constraints 317 // +optional 318 Affinity *corev1.Affinity `json:"affinity,omitempty"` 319 320 // List of containers specified in the Deployment 321 // +optional 322 Containers []ContainerSpec `json:"containers"` 323 } 324 325 // ContainerSpec defines the properties available to override for each 326 // container in a provider deployment such as Image and Args to the container’s 327 // entrypoint. 328 type ContainerSpec struct { 329 // Name of the container. Cannot be updated. 330 Name string `json:"name"` 331 332 // Container Image Name 333 // +optional 334 Image *ImageMeta `json:"image,omitempty"` 335 336 // Args represents extra provider specific flags that are not encoded as fields in this API. 337 // Explicit controller manager properties defined in the `Provider.ManagerSpec` 338 // will have higher precedence than those defined in `ContainerSpec.Args`. 339 // For example, `ManagerSpec.SyncPeriod` will be used instead of the 340 // container arg `--sync-period` if both are defined. 341 // The same holds for `ManagerSpec.FeatureGates` and `--feature-gates`. 342 // +optional 343 Args map[string]string `json:"args,omitempty"` 344 345 // List of environment variables to set in the container. 346 // +optional 347 Env []corev1.EnvVar `json:"env,omitempty"` 348 349 // Compute resources required by this container. 350 // +optional 351 Resources *corev1.ResourceRequirements `json:"resources,omitempty"` 352 } 353 354 // ImageMeta allows to customize the image used 355 type ImageMeta struct { 356 // Repository sets the container registry to pull images from. 357 // +optional 358 Repository *string `json:"repository,omitempty` 359 360 // Name allows to specify a name for the image. 361 // +optional 362 Name *string `json:"name,omitempty` 363 364 // Tag allows to specify a tag for the image. 365 // +optional 366 Tag *string `json:"tag,omitempty` 367 } 368 369 // FetchConfiguration determines the way to fetch the components and metadata for the provider. 370 type FetchConfiguration struct { 371 // URL to be used for fetching the provider’s components and metadata from a remote Github repository. 372 // For example, https://github.com/{owner}/{repository}/releases 373 // The version of the release will be `ProviderSpec.Version` if defined 374 // otherwise the `latest` version will be computed and used. 375 // +optional 376 URL *string `json:"url,omitempty"` 377 378 // Selector to be used for fetching provider’s components and metadata from 379 // ConfigMaps stored inside the cluster. Each ConfigMap is expected to contain 380 // components and metadata for a specific version only. 381 // +optional 382 Selector *metav1.LabelSelector `json:"selector,omitempty"` 383 } 384 385 // ProviderStatus defines the observed state of the Provider. 386 type ProviderStatus struct { 387 // Contract will contain the core provider contract that the provider is 388 // abiding by, like e.g. v1alpha3. 389 // +optional 390 Contract *string `json:"contract,omitempty"` 391 392 // Conditions define the current service state of the cluster. 393 // +optional 394 Conditions Conditions `json:"conditions,omitempty"` 395 396 // ObservedGeneration is the latest generation observed by the controller. 397 // +optional 398 ObservedGeneration int64 `json:"observedGeneration,omitempty"` 399 } 400 ``` 401 402 **Validation and defaulting rules for Provider and ProviderSpec** 403 - The `Name` field within `metav1.ObjectMeta` could be any valid Kubernetes 404 name; however, it is recommended to use Cluster API provider names. For 405 example, aws, vsphere, kubeadm. These names will be used to fetch the 406 default configurations in case there is no specific FetchConfiguration 407 defined. 408 - `ProviderSpec.Version` should be a valid default version with the "v" prefix 409 as commonly used in the Kubernetes ecosystem; if this value is nil when a 410 new provider is created, the operator will determine the version to use 411 applying the same rules implemented in clusterctl (latest). 412 Once the latest version is calculated it will be set in 413 `ProviderSpec.Version`. 414 - Note: As per discussion in the CAEP PR, we will keep the `SecretName` field 415 to allow the provider authors ample time to implement their own credential 416 management to support multiple workload clusters. [See this thread for more 417 info][secret-name-discussion]. 418 419 **Validation rules for ProviderSpec.FetchConfiguration** 420 - If the FetchConfiguration is empty and not defined, then the operator will 421 apply the embedded fetch configuration for the given kind and 422 `ObjectMeta.Name`. For example, the infrastructure name `aws` will fetch 423 artifacts from 424 https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases. 425 - If FetchConfiguration is not nil, exactly one of `URL` or `Selector` must be 426 specified. 427 - `FetchConfiguration.Selector` is used to fetch provider’s components and 428 metadata from ConfigMaps stored inside the cluster. Each ConfigMap is 429 expected to contain components and metadata for a specific version only. So 430 if multiple versions of the providers need to be specified, they can be 431 added as separate ConfigMaps and labeled with the same selector. This 432 provides the same behavior as the “local” provider repositories but now from 433 within the management cluster. 434 - `FetchConfiguration` is used only during init and upgrade operations. 435 Changes made to the contents of `FetchConfiguration` will not trigger a 436 reconciliation. This is similar behavior to `ProviderSpec.SecretName`. 437 438 **Validation Rules for ProviderSpec.ManagerSpec** 439 - The ControllerManagerConfigurationSpec is a type from 440 `controller-runtime/pkg/config` and is an embedded into the `ManagerSpec`. 441 This type will expose LeaderElection, SyncPeriod, Webhook, Health and 442 Metrics configurations. 443 - If `ManagerSpec.Debug` is set to true, the operator will not allow changes 444 to other properties since it is in Debug mode. 445 - If you need to set specific concurrency values for each reconcile loop (e.g. 446 `awscluster-concurrency`), you can leave 447 `ManagerSpec.MaxConcurrentReconciles` nil and use `Container.Args`. 448 - If `ManagerSpec.MaxConcurrentReconciles` is set and a specific concurrency 449 flag such as `awscluster-concurrency` is set on the `Container.Args`, then 450 the more specific concurrency flag will have higher precedence. 451 452 453 **Validation Rules for ContainerSpec** 454 - The `ContainerSpec.Args` will ignore the key `namespace` since the operator 455 enforces a deployment model where all the providers should be configured to 456 watch all the namespaces. 457 - Explicit controller manager properties defined in the `Provider.ManagerSpec` 458 will have higher precedence than those defined in `ContainerSpec.Args`. That 459 is, if `ManagerSpec.SyncPeriod` is defined it will be used instead of the 460 container arg `sync-period`. This is true also for 461 `ManagerSpec.FeatureGates`, that is, it will have higher precedence to the 462 container arg `feature-gates`. 463 - If no `ContainerSpec.Resources` are defined, the defaults on the Deployment 464 object within the provider’s components yaml will be used. 465 466 467 #### Example API Usage 468 469 1. As an admin, I want to install the aws infrastructure provider with 470 specific controller flags. 471 472 ```yaml 473 apiVersion: v1 474 kind: Secret 475 metadata: 476 name: aws-variables 477 namespace: capa-system 478 type: Opaque 479 data: 480 AWS_REGION: ... 481 AWS_ACCESS_KEY_ID: ... 482 AWS_SECRET_ACCESS_KEY: ... 483 --- 484 apiVersion: management.cluster.x-k8s.io/v1alpha1 485 kind: InfrastructureProvider 486 metadata: 487 name: aws 488 namespace: capa-system 489 spec: 490 version: v0.6.0 491 secretName: aws-variables 492 manager: 493 # These top level controller manager flags, supported by all the providers. 494 # These flags come with sensible defaults, thus requiring no or minimal 495 # changes for the most common scenarios. 496 metricsAddress: ":8181" 497 syncPeriod: 660 498 fetchConfig: 499 url: https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases 500 deployment: 501 containers: 502 - name: manager 503 args: 504 # These are controller flags that are specific to a provider; usage 505 # is reserved for advanced scenarios only. 506 awscluster-concurrency: 12 507 awsmachine-concurrency: 11 508 ``` 509 510 2. As an admin, I want to install aws infrastructure provider but override 511 the container image of the CAPA deployment. 512 513 ```yaml 514 --- 515 apiVersion: management.cluster.x-k8s.io/v1alpha1 516 kind: InfrastructureProvider 517 metadata: 518 name: aws 519 namespace: capa-system 520 spec: 521 version: v0.6.0 522 secretName: aws-variables 523 deployment: 524 containers: 525 - name: manager 526 image: gcr.io/myregistry/capa-controller:v0.6.0-foo 527 ``` 528 529 3. As an admin, I want to change the resource limits for the manager pod in 530 my control plane provider deployment. 531 532 ```yaml 533 --- 534 apiVersion: management.cluster.x-k8s.io/v1alpha1 535 kind: ControlPlaneProvider 536 metadata: 537 name: kubeadm 538 namespace: capi-kubeadm-control-plane-system 539 spec: 540 version: v0.3.10 541 secretName: capi-variables 542 deployment: 543 containers: 544 - name: manager 545 resources: 546 limits: 547 cpu: 100m 548 memory: 30Mi 549 requests: 550 cpu: 100m 551 memory: 20Mi 552 ``` 553 554 4. As an admin, I would like to fetch my azure provider components from a 555 specific repository which is not the default. 556 557 ```yaml 558 --- 559 apiVersion: management.cluster.x-k8s.io/v1alpha1 560 kind: InfrastructureProvider 561 metadata: 562 name: myazure 563 namespace: capz-system 564 spec: 565 version: v0.4.9 566 secretName: azure-variables 567 fetchConfig: 568 url: https://github.com/myorg/awesome-azure-provider/releases 569 570 ``` 571 572 5. As an admin, I would like to use the default fetch configurations by 573 simply specifying the expected Cluster API provider names such as 'aws', 574 'vsphere', 'azure', 'kubeadm', 'talos', or 'cluster-api' instead of having 575 to explicitly specify the fetch configuration. 576 In the example below, since we are using 'vsphere' as the name of the 577 InfrastructureProvider the operator will fetch it's configuration from 578 `url: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/releases` 579 by default. 580 581 See more examples in the [air-gapped environment section](#air-gapped-environment) 582 583 ```yaml 584 --- 585 apiVersion: management.cluster.x-k8s.io/v1alpha1 586 kind: InfrastructureProvider 587 metadata: 588 name: vsphere 589 namespace: capv-system 590 spec: 591 version: v0.4.9 592 secretName: vsphere-variables 593 594 ``` 595 596 #### Operator Behaviors 597 598 ##### Installing a provider 599 600 In order to install a new Cluster API provider with the management cluster 601 operator you have to create a provider as shown above. See the first example 602 API usage to create the secret with variables and the provider itself. 603 604 When processing a Provider object the operator will apply the following rules. 605 606 - Providers with `spec.Type == CoreProvider` will be installed first; the 607 other providers will be requeued until the core provider exists. 608 - Before installing any provider following preflight checks will be executed : 609 - There should not be another instance of the same provider (same Kind, same 610 name) in any namespace. 611 - The Cluster API contract the provider is abiding by, e.g. v1alpha4, must 612 match the contract of the core provider. 613 - The operator will set conditions on the Provider object to surface any 614 installation issues such as pre-flight checks and/or order of installation 615 to accurately inform the user. 616 - Since the FetchConfiguration is empty and not defined, the operator will 617 apply the embedded fetch configuration for the given kind and 618 `ObjectMeta.Name`. In this case, the operator will fetch artifacts from 619 https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases. 620 621 The installation process managed by the operator is consistent with the 622 implementation underlying the `clusterctl init` command and includes the 623 following steps: 624 - Fetching the provider artifacts (the components yaml and the metadata.yaml 625 file). 626 - Applying image overrides, if any. 627 - Replacing variables in the infrastructure-components from EnvVar and 628 Secret. 629 - Applying the resulting yaml to the cluster. 630 631 As a final consideration, please note that 632 - The operator executes installation for 1 provider at time, while `clusterctl 633 init` manages installation of a group of providers with a single operation. 634 - `clusterctl init` uses environment variables and a local configuration file, 635 while the operator uses a Secret; given that we want the users to preserve 636 current behaviour in clusterctl, the init operation should be modified to 637 transfer local configuration to the cluster. 638 As part of `clusterctl init`, it will obtain the list of variables required 639 by the provider components and read the corresponding values from the config 640 or environment variables and build the secret. 641 Any image overrides defined in the clusterctl config will also be applied to 642 the provider's components. 643 644 In the following figure, the controllers for the providers are installed in 645 the namespaces that are defined by default. 646 647  649 <div align="center">Installing providers in defined namespaces</div> 650 <br/> 651 652 In the following figure, the controllers for the providers are all installed in 653 the same namespace as configured by the user. 654 655  657 <div align="center">Installing all providers in the same namespace</div> 658 <br/> 659 660 ##### Upgrading a provider 661 662 In order to trigger an upgrade of a new Cluster API provider you have to 663 change the `spec.Version` field. 664 665 Upgrading a provider in the management cluster must abide by the golden rule 666 that all the providers should respect the same Cluster API contract supported 667 by the core provider. 668 669 ##### Upgrades providers without changing contract 670 671 If the new version of the provider does abide by the same version of the 672 Cluster API contract, the operator will execute the upgrade by performing: 673 - Delete of the current instance of the provider components, while preserving 674 CRDs, namespace and user objects. 675 - Install the new version of the provider components 676 677 Please note that: 678 - The operator executes upgrades 1 provider at time, while `clusterctl upgrade 679 apply` manages upgrading a group of providers with a single operation. 680 - `clusterctl upgrade apply --contract` automatically determines the latest 681 versions available for each provider, while with the Declarative approach 682 the user is responsible for manually editing Provider objects yaml. 683 - `clusterctl upgrade apply` currently uses environment variables and a local 684 configuration file; this should be changed in order to use in cluster 685 provider configurations. 686 687  689 <div align="center">Upgrading providers without changing contract</div> 690 <br/> 691 692 ##### Upgrades providers and changing contract 693 694 If the new version of the provider does abide by a new version of the Cluster 695 API contract, it is required to ensure all the other providers in the 696 management cluster should get the new version too. 697 698  700 <div align="center">Upgrading providers and changing contract</div> 701 <br/> 702 703 As a first step, it is required to pause all the providers by setting the 704 `spec.Paused` field to true for each provider; the operator will block any 705 contract upgrade until all the providers are paused. 706 707 After all the providers are in paused state, you can proceed with the upgrade 708 as described in the previous paragraph (change the `spec.Version` field). 709 710 When a provider is paused the number of replicas will be scaled to 0; the 711 operator will add a new 712 `management.cluster.x-k8s.io/original-controller-replicas` annotation to store 713 the original replica count. 714 715 Once all the providers are upgraded to a version that abides to the new 716 contract, it is possible for the operator to unpause providers; the operator 717 does not allow to unpause providers if there are still providers abiding to 718 the old contract. 719 720 Please note that we are planning to embed this sequence (pause - upgrade - 721 unpause) as a part of `clusterctl upgrade apply` command when there is a 722 contract change. 723 724 ##### Changing a provider 725 726 On top of changing a provider version (upgrades), the operator supports also 727 changing other provider fields, most notably controller flags and variables. 728 This can be achieved by either `kubectl edit` or `kubectl apply` to the 729 provider object. 730 731 The operation internally works like upgrades: The current instance of the 732 provider is deleted, while preserving CRDs, namespaced and user objects A new 733 instance of the provider is installed with the new set of flags/variables. 734 735 Please note that clusterctl currently does not support this operation. 736 737 See Example 1 in [Example API Usage](#example-api-usage) 738 739 ##### Deleting a provider 740 741 In order to delete a provider you have to delete the corresponding provider 742 object. 743 744 Deletion of the provider will be blocked if any workload cluster using the 745 provider still exists. 746 747 Additionally, deletion of a core provider should be blocked if there are still 748 other providers in the management cluster. 749 750 #### Upgrade from v1alpha3 management cluster to v1alpha4 cluster 751 752 Cluster API will provide instructions on how to upgrade from a v1alpha3 753 management cluster, created by clusterctl to the new v1alpha4 management 754 cluster. These operations could require manual actions. 755 756 Some of the actions are described below: 757 - Run webhooks as part of the main manager. See [issue 3822]. 758 759 More details will be added as we better understand what a v1alpha4 cluster 760 will look like. 761 762 #### Operator Lifecycle Management 763 764 ##### Operator Installation 765 766 - During the first phase of implementation `clusterctl` won't provide support 767 for managing the operator, so the admin will have to install it manually using 768 `kubectl apply` (or similar solutions), the operator yaml that will be published in the 769 operator subproject release artifacts. 770 - In future `clusterctl init` will install the operator and its corresponding CRDs 771 as a pre-requisite if the operator doesn’t already exist. Please note that this 772 command will consider image overrides defined in the local clusterctl config 773 file. 774 775 ##### Operator Upgrade 776 - During the first phase of implementation `clusterctl` operations will not be 777 supported and admin will have to install the operator manually, or in case if 778 the admin doesn’t want to use clusterctl, they can use `kubectl apply`(or similar solutions) 779 with the latest version of the operator yaml that will be published in the 780 operator subproject release artifacts. 781 - The transition between manually managed operator and clusterctl managed 782 operator will be documented later as we progress with the implementation. 783 - In future the admin will be able to use `clusterctl upgrade operator` to 784 upgrade the operator components. Please note that this command will consider 785 image overrides defined in the local clusterctl config file. Other commands 786 such as `clusterctl upgrade apply` will also allow to upgrade the operator. 787 - `clusterctl upgrade plan` will identify when the operator can be upgraded by 788 checking the cluster-api release artifacts. 789 - clusterctl will require a matching operator version. In the future, when 790 clusterctl move to beta/GA, we will reconsider supporting version skew 791 between clusterctl and the operator. 792 793 ##### Operator Delete 794 - During the first phase of implementation `clusterctl` operations will not be 795 supported and admin will have to delete the operator manually using `kubectl delete` 796 (or similar solutions). However, it’s the admin’s responsibility to verify that there 797 are no providers running in the management cluster. 798 - In future the clusterctl will delete the operator as part of 799 the `clusterctl delete --all` command. 800 801 #### Air gapped environment 802 803 In order to install Cluster API providers in an air-gapped environment using 804 the operator, it is required to address the following issues. 805 806 1. Make the operator work in air-gapped environment 807 - To provide image overrides for the operator itself in order to pull the 808 images from an accessible image repository. Please note that the 809 overrides will be considered from the image overrides defined in the 810 local clusterctl config file. 811 - TBD if operator yaml will be embedded in clusterctl or if it should be a 812 special artifact within the core provider repository. 813 1. Make the providers work in air-gapped environment 814 - To provide fetch configuration for each provider reading from an 815 accessible location (e.g. an internal github repository) or from 816 ConfigMaps pre-created inside the cluster. 817 - To provide image overrides for each provider in order to pull the images 818 from an accessible image repository. 819 820 **Example Usage** 821 822 As an admin, I would like to fetch my azure provider components from within 823 the cluster because I’m working within an air-gapped environment. 824 825 In this example, we have two config maps that define the components and 826 metadata of the provider. They each share the label `provider-components: 827 azure` and are within the `capz-system` namespace. 828 829 The azure InfrastructureProvider has a `fetchConfig` which specifies the label 830 selector. This way the operator knows which versions of the azure provider are 831 available. Since the provider’s version is marked as `v0.4.9`, it uses the 832 components information from the config map to install the azure provider. 833 834 ```yaml 835 --- 836 apiVersion: v1 837 kind: ConfigMap 838 metadata: 839 labels: 840 provider-components: azure 841 name: v0.4.9 842 namespace: capz-system 843 data: 844 components: | 845 # components for v0.4.9 yaml goes here 846 metadata: | 847 # metadata information goes here 848 --- 849 apiVersion: v1 850 kind: ConfigMap 851 metadata: 852 labels: 853 provider-components: azure 854 name: v0.4.8 855 namespace: capz-system 856 data: 857 components: | 858 # components for v0.4.8 yaml goes here 859 metadata: | 860 # metadata information goes here 861 --- 862 apiVersion: management.cluster.x-k8s.io/v1alpha1 863 kind: InfrastructureProvider 864 metadata: 865 name: azure 866 namespace: capz-system 867 spec: 868 version: v0.4.9 869 secretName: azure-variables 870 fetchConfig: 871 selector: 872 matchLabels: 873 provider-components: azure 874 ``` 875 876 ### Risks and Mitigation 877 878 #### Error Handling & Logging 879 880 Currently, clusterctl provides quick feedback regarding required variables 881 etc. With the operator in place we’ll need to ensure that the error messages 882 and logs are easily available to the user to verify progress. 883 884 #### Extensibility Options 885 886 Currently, clusterctl has a few extensibility options. For example, 887 clusterctl is built on-top of a library that can be leveraged to build other 888 tools. 889 890 It also exposes an interface for template processing if we choose to go a 891 different route from `envsubst`. This may prove to be challenging in the 892 context of the operator as this would mean a change to the operator 893 binary/image. We could introduce a new behavior or communication protocol or 894 hooks for the operator to interact with the custom template processor. This 895 could be configured similarly to the fetch config, with multiple options built 896 in. 897 898 We have decided that supporting multiple template processors is a non-goal for 899 this implementation of the proposal and we will rely on using the default 900 `envsubst` template processor. 901 902 #### Upgrade from v1alpha3 management cluster to v1alpha4/operator cluster 903 904 As of today, this is hard to define as have yet to understand the definition 905 of what a v1alpha4 cluster will be. Once we better understand what a v1alpha4 906 cluster will look like, we will then be able to determine the upgrade sequence 907 from v1alpha3. 908 909 Cluster API will provide instructions on how to upgrade from a v1alpha3 910 management cluster, created by clusterctl to the new v1alpha4 management 911 cluster. These operations could require manual actions. 912 913 Some of the actions are described below: 914 - Run webhooks as part of the main manager. See [issue 915 3822](https://github.com/kubernetes-sigs/cluster-api/issues/3822) 916 917 ## Additional Details 918 919 ### Test Plan 920 921 The operator will be written with unit and integration tests using envtest and 922 existing patterns as defined under the [Developer 923 Guide/Testing](https://cluster-api.sigs.k8s.io/developer/testing.html) section 924 in the Cluster API book. 925 926 Existing E2E tests will verify that existing clusterctl commands such as `init` 927 and `upgrade` will work as expected. Any necessary changes will be made in 928 order to make it configurable. 929 930 New E2E tests verifying the operator lifecycle itself will be added. 931 932 New E2E tests verifying the upgrade from a v1alpha3 to v1alpha4 cluster will 933 be added. 934 935 ### Version Skew Strategy 936 937 - clusterctl will require a matching operator version. In the future, when 938 clusterctl move to beta/GA, we will reconsider supporting version skew 939 between clusterctl and the operator. 940 941 ## Implementation History 942 943 - [x] 09/09/2020: Proposed idea in an issue or [community meeting] 944 - [x] 09/14/2020: Compile a [Google Doc following the CAEP template][management cluster operator caep] 945 - [x] 09/14/2020: First round of feedback from community 946 - [x] 10/07/2020: Present proposal at a [community meeting] 947 - [ ] 10/20/2020: Open proposal PR 948 949 ## Controller Runtime Types 950 951 These types are pulled from [controller-runtime][controller-runtime-code-ref] 952 and [component-base][components-base-code-ref]. They are used as part of the 953 `ManagerSpec`. They are duplicated here for convenience sake. 954 955 ```golang 956 // ControllerManagerConfigurationSpec defines the desired state of GenericControllerManagerConfiguration 957 type ControllerManagerConfigurationSpec struct { 958 // SyncPeriod determines the minimum frequency at which watched resources are 959 // reconciled. A lower period will correct entropy more quickly, but reduce 960 // responsiveness to change if there are many watched resources. Change this 961 // value only if you know what you are doing. Defaults to 10 hours if unset. 962 // there will a 10 percent jitter between the SyncPeriod of all controllers 963 // so that all controllers will not send list requests simultaneously. 964 // +optional 965 SyncPeriod *metav1.Duration `json:"syncPeriod,omitempty"` 966 967 // LeaderElection is the LeaderElection config to be used when configuring 968 // the manager.Manager leader election 969 // +optional 970 LeaderElection *configv1alpha1.LeaderElectionConfiguration `json:"leaderElection,omitempty"` 971 972 // CacheNamespace if specified restricts the manager's cache to watch objects in 973 // the desired namespace Defaults to all namespaces 974 // 975 // Note: If a namespace is specified, controllers can still Watch for a 976 // cluster-scoped resource (e.g Node). For namespaced resources the cache 977 // will only hold objects from the desired namespace. 978 // +optional 979 CacheNamespace string `json:"cacheNamespace,omitempty"` 980 981 // GracefulShutdownTimeout is the duration given to runnable to stop before the manager actually returns on stop. 982 // To disable graceful shutdown, set to time.Duration(0) 983 // To use graceful shutdown without timeout, set to a negative duration, e.G. time.Duration(-1) 984 // The graceful shutdown is skipped for safety reasons in case the leader election lease is lost. 985 GracefulShutdownTimeout *metav1.Duration `json:"gracefulShutDown,omitempty"` 986 987 // Metrics contains thw controller metrics configuration 988 // +optional 989 Metrics ControllerMetrics `json:"metrics,omitempty"` 990 991 // Health contains the controller health configuration 992 // +optional 993 Health ControllerHealth `json:"health,omitempty"` 994 995 // Webhook contains the controllers webhook configuration 996 // +optional 997 Webhook ControllerWebhook `json:"webhook,omitempty"` 998 } 999 1000 // ControllerMetrics defines the metrics configs 1001 type ControllerMetrics struct { 1002 // BindAddress is the TCP address that the controller should bind to 1003 // for serving prometheus metrics. 1004 // It can be set to "0" to disable the metrics serving. 1005 // +optional 1006 BindAddress string `json:"bindAddress,omitempty"` 1007 } 1008 1009 // ControllerHealth defines the health configs 1010 type ControllerHealth struct { 1011 // HealthProbeBindAddress is the TCP address that the controller should bind to 1012 // for serving health probes 1013 // +optional 1014 HealthProbeBindAddress string `json:"healthProbeBindAddress,omitempty"` 1015 1016 // ReadinessEndpointName, defaults to "readyz" 1017 // +optional 1018 ReadinessEndpointName string `json:"readinessEndpointName,omitempty"` 1019 1020 // LivenessEndpointName, defaults to "healthz" 1021 // +optional 1022 LivenessEndpointName string `json:"livenessEndpointName,omitempty"` 1023 } 1024 1025 // ControllerWebhook defines the webhook server for the controller 1026 type ControllerWebhook struct { 1027 // Port is the port that the webhook server serves at. 1028 // It is used to set webhook.Server.Port. 1029 // +optional 1030 Port *int `json:"port,omitempty"` 1031 1032 // Host is the hostname that the webhook server binds to. 1033 // It is used to set webhook.Server.Host. 1034 // +optional 1035 Host string `json:"host,omitempty"` 1036 1037 // CertDir is the directory that contains the server key and certificate. 1038 // if not set, webhook server would look up the server key and certificate in 1039 // {TempDir}/k8s-webhook-server/serving-certs. The server key and certificate 1040 // must be named tls.key and tls.crt, respectively. 1041 // +optional 1042 CertDir string `json:"certDir,omitempty"` 1043 } 1044 1045 // LeaderElectionConfiguration defines the configuration of leader election 1046 // clients for components that can run with leader election enabled. 1047 type LeaderElectionConfiguration struct { 1048 // leaderElect enables a leader election client to gain leadership 1049 // before executing the main loop. Enable this when running replicated 1050 // components for high availability. 1051 LeaderElect *bool `json:"leaderElect"` 1052 // leaseDuration is the duration that non-leader candidates will wait 1053 // after observing a leadership renewal until attempting to acquire 1054 // leadership of a led but unrenewed leader slot. This is effectively the 1055 // maximum duration that a leader can be stopped before it is replaced 1056 // by another candidate. This is only applicable if leader election is 1057 // enabled. 1058 LeaseDuration metav1.Duration `json:"leaseDuration"` 1059 // renewDeadline is the interval between attempts by the acting master to 1060 // renew a leadership slot before it stops leading. This must be less 1061 // than or equal to the lease duration. This is only applicable if leader 1062 // election is enabled. 1063 RenewDeadline metav1.Duration `json:"renewDeadline"` 1064 // retryPeriod is the duration the clients should wait between attempting 1065 // acquisition and renewal of a leadership. This is only applicable if 1066 // leader election is enabled. 1067 RetryPeriod metav1.Duration `json:"retryPeriod"` 1068 // resourceLock indicates the resource object type that will be used to lock 1069 // during leader election cycles. 1070 ResourceLock string `json:"resourceLock"` 1071 // resourceName indicates the name of resource object that will be used to lock 1072 // during leader election cycles. 1073 ResourceName string `json:"resourceName"` 1074 // resourceName indicates the namespace of resource object that will be used to lock 1075 // during leader election cycles. 1076 ResourceNamespace string `json:"resourceNamespace"` 1077 } 1078 ``` 1079 1080 <!-- Links --> 1081 [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY 1082 [management cluster operator caep]: https://docs.google.com/document/d/1fQNlqsDkvEggWFi51GVxOglL2P1Bvo2JhZlMhm2d-Co/edit# 1083 [controller-runtime-code-ref]: https://github.com/kubernetes-sigs/controller-runtime/blob/5c2b42d0dfe264fe1a187dcb11f384c0d193c042/pkg/config/v1alpha1/types.go 1084 [components-base-code-ref]: https://github.com/kubernetes/component-base/blob/3b346c3e81285da5524c9379262ad4ca327b3c75/config/v1alpha1/types.go 1085 [issue 3042]: https://github.com/kubernetes-sigs/cluster-api/issues/3042 1086 [issue 3354]: https://github.com/kubernetes-sigs/cluster-api/issues/3354 1087 [issue 3822]: https://github.com/kubernetes-sigs/cluster-api/issues/3822) 1088 [secret-name-discussion]: https://github.com/kubernetes-sigs/cluster-api/pull/3833#discussion_r540576353