sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20201020-capi-provider-operator.md (about)

     1  ---
     2  title: CAPI Provider Operator
     3  authors:
     4    - "@fabriziopandini"
     5    - "@wfernandes"
     6  reviewers:
     7    - "@vincepri"
     8    - "@ncdc"
     9    - "@justinsb"
    10    - "@detiber"
    11    - "@CecileRobertMichon"
    12  creation-date: 2020-09-14
    13  last-updated: 2021-01-20
    14  status: implementable
    15  see-also:
    16  https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191016-clusterctl-redesign.md
    17  ---
    18  
    19  # CAPI Provider operator
    20  
    21  ## Table of Contents
    22  
    23  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    24  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    25  
    26  - [Glossary](#glossary)
    27  - [Summary](#summary)
    28  - [Motivation](#motivation)
    29    - [Goals](#goals)
    30    - [Non-Goals/Future Work](#non-goalsfuture-work)
    31  - [Proposal](#proposal)
    32    - [User Stories](#user-stories)
    33    - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
    34    - [Clusterctl](#clusterctl)
    35      - [Existing API Types Changes](#existing-api-types-changes)
    36      - [New API Types](#new-api-types)
    37      - [Example API Usage](#example-api-usage)
    38      - [Operator Behaviors](#operator-behaviors)
    39        - [Installing a provider](#installing-a-provider)
    40        - [Upgrading a provider](#upgrading-a-provider)
    41        - [Upgrades providers without changing contract](#upgrades-providers-without-changing-contract)
    42        - [Upgrades providers and changing contract](#upgrades-providers-and-changing-contract)
    43        - [Changing a provider](#changing-a-provider)
    44        - [Deleting a provider](#deleting-a-provider)
    45      - [Upgrade from v1alpha3 management cluster to v1alpha4 cluster](#upgrade-from-v1alpha3-management-cluster-to-v1alpha4-cluster)
    46      - [Operator Lifecycle Management](#operator-lifecycle-management)
    47        - [Operator Installation](#operator-installation)
    48        - [Operator Upgrade](#operator-upgrade)
    49        - [Operator Delete](#operator-delete)
    50      - [Air gapped environment](#air-gapped-environment)
    51    - [Risks and Mitigation](#risks-and-mitigation)
    52      - [Error Handling & Logging](#error-handling--logging)
    53      - [Extensibility Options](#extensibility-options)
    54      - [Upgrade from v1alpha3 management cluster to v1alpha4/operator cluster](#upgrade-from-v1alpha3-management-cluster-to-v1alpha4operator-cluster)
    55  - [Additional Details](#additional-details)
    56    - [Test Plan](#test-plan)
    57    - [Version Skew Strategy](#version-skew-strategy)
    58  - [Implementation History](#implementation-history)
    59  - [Controller Runtime Types](#controller-runtime-types)
    60  
    61  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    62  
    63  ## Glossary
    64  
    65  The lexicon used in this document is described in more detail
    66  [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/reference/glossary.md).
    67  Any discrepancies should be rectified in the main Cluster API glossary.
    68  
    69  ## Summary
    70  
    71  The clusterctl CLI currently handles the lifecycle of Cluster API
    72  providers installed in a management cluster. It provides a great Day 0 and Day
    73  1 experience in getting CAPI up and running. However, clusterctl’s imperative
    74  design makes it difficult for cluster admins to stand up and manage CAPI
    75  management clusters in their own preferred way.
    76  
    77  This proposal provides a solution that leverages a declarative API and an
    78  operator to empower admins to handle the lifecycle of providers within the
    79  management cluster.
    80  
    81  The operator is developed in a separate repository [TBD] and will have its own release cycle.
    82  
    83  ## Motivation
    84  
    85  In its current form clusterctl is designed to provide a simple user experience
    86  for day 1 operations of a Cluster API management cluster.
    87  
    88  However such design is not optimized for supporting declarative approaches
    89  when operating Cluster API management clusters.
    90  
    91  These declarative approaches are important to enable GitOps workflows in case
    92  users don't want to rely solely on the `clusterctl` CLI.
    93  
    94  Providing a declarative API also enables us to leverage controller-runtime's
    95  new component config and allow us to configure the controller manager and even
    96  the resource limits of the provider's deployment.
    97  
    98  Another example is improving cluster upgrades. In order to upgrade a cluster
    99  we now need to supply all the information that was provided initially during a
   100  `clusterctl init` which is inconvenient in many cases such as distributed
   101  teams and CI pipelines where the configuration needs to be stored and synced
   102  externally.
   103  
   104  With the management cluster operator, we aim to address these use cases by
   105  introducing an operator that handles the lifecycle of providers within the
   106  management cluster based on a declarative API.
   107  
   108  ### Goals
   109  
   110  - Define an API that enables declarative management of the lifecycle of
   111    Cluster API and all of its providers.
   112  - Support air-gapped environments through sufficient documentation initially.
   113  - Identify and document differences between clusterctl CLI and the operator in
   114    managing the lifecycle of providers, if any.
   115  - Define how the clusterctl CLI should be changed in order to interact with
   116    the management cluster operator in a transparent and effective way.
   117  - To support the ability to upgrade from a v1alpha3 based version (v0.3.[TBD])
   118    of Cluster API to one managed by the operator.
   119  
   120  ### Non-Goals/Future Work
   121  
   122  - `clusterctl` related changes will be implemented after core operator functionality
   123    is complete. For example, deprecating `Provider` type and migrating to new ones.
   124  - `clusterctl` will not be deprecated or replaced with another CLI.
   125  - Implement an operator driven version of `clusterctl move`.
   126  - Manage cert-manager using the operator.
   127  - Support multiple installations of the same provider within a management
   128    cluster in light of [issue 3042] and [issue 3354].
   129  - Support any template processing engines.
   130  - Support the installation of v1alpha3 providers using the operator.
   131  
   132  ## Proposal
   133  
   134  ### User Stories
   135  
   136  1. As an admin, I want to use a declarative style API to operate the Cluster
   137     API providers in a management cluster.
   138  1. As an admin, I would like to have an easy and declarative way to change
   139     controller settings (e.g. enabling pprof for debugging).
   140  1. As an admin, I would like to have an easy and declarative way to change the
   141     resource requirements (e.g. such as limits and requests for a provider
   142     deployment).
   143  1. As an admin, I would like to have the option to use clusterctl CLI as of
   144     today, without being concerned about the operator.
   145  1. As an admin, I would like to be able to install the operator using kubectl
   146     apply, without being forced to use clusterctl.
   147  
   148  ### Implementation Details/Notes/Constraints
   149  
   150  ### Clusterctl
   151  
   152  The `clusterctl` CLI will provide a similar UX to the users whilst leveraging
   153  the operator for the functions it can. As stated in the Goals/Non-Goals, the
   154  move operation will not be driven by the operator but rather remain within the
   155  CLI for now. However, this is an implementation detail and will not affect the
   156  users. The move operation and all other `clusterctl` refactoring will be
   157  done after core operator functionality is implemented.
   158  
   159  #### Existing API Types Changes
   160  
   161  The existing `Provider` type used by the clusterctl CLI will be deprecated and
   162  its instances will be migrated to instances of the new API types as defined in
   163  the next section.
   164  
   165  The management cluster operator will be responsible for migrating the existing
   166  provider types to support GitOps workflows excluding `clusterctl`.
   167  
   168  #### New API Types
   169  
   170  These are the new API types being defined.
   171  
   172  There are separate types for each provider type - Core, Bootstrap,
   173  ControlPlane, and Infrastructure. However, since each type is similar, their
   174  Spec and Status uses the shared types - `ProviderSpec`,  `ProviderStatus`
   175  respectively.
   176  
   177  We will scope the CRDs to be namespaced. This will allow us to enforce
   178  RBAC restrictions if needed. This also allows us to install multiple
   179  versions of the controllers (grouped within namespaces) in the same
   180  management cluster although this scenario will not be supported natively in
   181  the v1alpha4 iteration.
   182  
   183  If you prefer to see how the API can be used instead of reading the type
   184  definition feel free to jump to the [Example API Usage
   185  section](#example-api-usage)
   186  
   187  ```golang
   188  // CoreProvider is the Schema for the CoreProviders API
   189  type CoreProvider struct {
   190     metav1.TypeMeta   `json:",inline"`
   191     metav1.ObjectMeta `json:"metadata,omitempty"`
   192  
   193     Spec   ProviderSpec   `json:"spec,omitempty"`
   194     Status ProviderStatus `json:"status,omitempty"`
   195  }
   196  
   197  // BootstrapProvider is the Schema for the BootstrapProviders API
   198  type BootstrapProvider struct {
   199     metav1.TypeMeta   `json:",inline"`
   200     metav1.ObjectMeta `json:"metadata,omitempty"`
   201  
   202     Spec   ProviderSpec   `json:"spec,omitempty"`
   203     Status ProviderStatus `json:"status,omitempty"`
   204  }
   205  
   206  // ControlPlaneProvider is the Schema for the ControlPlaneProviders API
   207  type ControlPlaneProvider struct {
   208     metav1.TypeMeta   `json:",inline"`
   209     metav1.ObjectMeta `json:"metadata,omitempty"`
   210  
   211     Spec   ProviderSpec   `json:"spec,omitempty"`
   212     Status ProviderStatus `json:"status,omitempty"`
   213  }
   214  
   215  // InfrastructureProvider is the Schema for the InfrastructureProviders API
   216  type InfrastructureProvider struct {
   217     metav1.TypeMeta   `json:",inline"`
   218     metav1.ObjectMeta `json:"metadata,omitempty"`
   219  
   220     Spec   ProviderSpec   `json:"spec,omitempty"`
   221     Status ProviderStatus `json:"status,omitempty"`
   222  }
   223  ```
   224  
   225  Below you can find details about `ProviderSpec`,  `ProviderStatus`, which is
   226  shared among all the provider types - Core, Bootstrap, ControlPlane, and
   227  Infrastructure.
   228  
   229  ```golang
   230  // ProviderSpec defines the desired state of the Provider.
   231  type ProviderSpec struct {
   232     // Version indicates the provider version.
   233     // +optional
   234     Version *string `json:"version,omitempty"`
   235  
   236     // Manager defines the properties that can be enabled on the controller manager for the provider.
   237     // +optional
   238     Manager ManagerSpec `json:"manager,omitempty"`
   239  
   240     // Deployment defines the properties that can be enabled on the deployment for the provider.
   241     // +optional
   242     Deployment *DeploymentSpec `json:"deployment,omitempty"`
   243  
   244     // SecretName is the name of the Secret providing the configuration
   245     // variables for the current provider instance, like e.g. credentials.
   246     // Such configurations will be used when creating or upgrading provider components.
   247     // The contents of the secret will be treated as immutable. If changes need
   248     // to be made, a new object can be created and the name should be updated.
   249     // The contents should be in the form of key:value. This secret must be in
   250     // the same namespace as the provider.
   251     // +optional
   252     SecretName *string `json:"secretName,omitempty"`
   253  
   254     // FetchConfig determines how the operator will fetch the components and metadata for the provider.
   255     // If nil, the operator will try to fetch components according to default
   256     // embedded fetch configuration for the given kind and `ObjectMeta.Name`.
   257     // For example, the infrastructure name `aws` will fetch artifacts from
   258     // https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases.
   259     // +optional
   260     FetchConfig *FetchConfiguration `json:"fetchConfig,omitempty"`
   261  
   262     // Paused prevents the operator from reconciling the provider. This can be
   263     // used when doing an upgrade or move action manually.
   264     // +optional
   265     Paused bool `json:"paused,omitempty"`
   266  }
   267  
   268  // ManagerSpec defines the properties that can be enabled on the controller manager for the provider.
   269  type ManagerSpec struct {
   270     // ControllerManagerConfigurationSpec defines the desired state of GenericControllerManagerConfiguration.
   271     ctrlruntime.ControllerManagerConfigurationSpec `json:",inline"`
   272  
   273     // ProfilerAddress defines the bind address to expose the pprof profiler (e.g. localhost:6060).
   274     // Default empty, meaning the profiler is disabled.
   275     // Controller Manager flag is --profiler-address.
   276     // +optional
   277     ProfilerAddress *string `json:"profilerAddress,omitempty"`
   278  
   279     // MaxConcurrentReconciles is the maximum number of concurrent Reconciles
   280     // which can be run. Defaults to 10.
   281     // +optional
   282     MaxConcurrentReconciles *int `json:"maxConcurrentReconciles,omitempty"`
   283  
   284     // Verbosity set the logs verbosity. Defaults to 1.
   285     // Controller Manager flag is --verbosity.
   286     // +optional
   287     Verbosity int `json:"verbosity,omitempty"`
   288  
   289     // Debug, if set, will override a set of fields with opinionated values for
   290     // a debugging session. (Verbosity=5, ProfilerAddress=localhost:6060)
   291     // +optional
   292     Debug bool `json:"debug,omitempty"`
   293  
   294     // FeatureGates define provider specific feature flags that will be passed
   295     // in as container args to the provider's controller manager.
   296     // Controller Manager flag is --feature-gates.
   297     FeatureGates map[string]bool `json:"featureGates,omitempty"`
   298  }
   299  
   300  // DeploymentSpec defines the properties that can be enabled on the Deployment for the provider.
   301  type DeploymentSpec struct {
   302     // Number of desired pods. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.
   303     // +optional
   304     Replicas *int `json:"replicas,omitempty"`
   305  
   306     // NodeSelector is a selector which must be true for the pod to fit on a node.
   307     // Selector which must match a node's labels for the pod to be scheduled on that node.
   308     // More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
   309     // +optional
   310     NodeSelector map[string]string `json:"nodeSelector,omitempty"`
   311  
   312     // If specified, the pod's tolerations.
   313     // +optional
   314     Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
   315  
   316     // If specified, the pod's scheduling constraints
   317     // +optional
   318     Affinity *corev1.Affinity `json:"affinity,omitempty"`
   319  
   320     // List of containers specified in the Deployment
   321     // +optional
   322     Containers []ContainerSpec `json:"containers"`
   323  }
   324  
   325  // ContainerSpec defines the properties available to override for each
   326  // container in a provider deployment such as Image and Args to the container’s
   327  // entrypoint.
   328  type ContainerSpec struct {
   329     // Name of the container. Cannot be updated.
   330     Name string `json:"name"`
   331  
   332     // Container Image Name
   333     // +optional
   334     Image *ImageMeta `json:"image,omitempty"`
   335  
   336     // Args represents extra provider specific flags that are not encoded as fields in this API.
   337     // Explicit controller manager properties defined in the `Provider.ManagerSpec`
   338     // will have higher precedence than those defined in `ContainerSpec.Args`.
   339     // For example, `ManagerSpec.SyncPeriod` will be used instead of the
   340     // container arg `--sync-period` if both are defined.
   341     // The same holds for `ManagerSpec.FeatureGates` and `--feature-gates`.
   342     // +optional
   343     Args map[string]string `json:"args,omitempty"`
   344  
   345     // List of environment variables to set in the container.
   346     // +optional
   347     Env []corev1.EnvVar `json:"env,omitempty"`
   348  
   349     // Compute resources required by this container.
   350     // +optional
   351     Resources *corev1.ResourceRequirements `json:"resources,omitempty"`
   352  }
   353  
   354  // ImageMeta allows to customize the image used
   355  type ImageMeta struct {
   356      // Repository sets the container registry to pull images from.
   357      // +optional
   358      Repository *string `json:"repository,omitempty`
   359  
   360      // Name allows to specify a name for the image.
   361      // +optional
   362      Name *string `json:"name,omitempty`
   363  
   364      // Tag allows to specify a tag for the image.
   365      // +optional
   366      Tag *string `json:"tag,omitempty`
   367  }
   368  
   369  // FetchConfiguration determines the way to fetch the components and metadata for the provider.
   370  type FetchConfiguration struct {
   371     // URL to be used for fetching the provider’s components and metadata from a remote Github repository.
   372     // For example, https://github.com/{owner}/{repository}/releases
   373     // The version of the release will be `ProviderSpec.Version` if defined
   374     // otherwise the `latest` version will be computed and used.
   375     // +optional
   376     URL *string `json:"url,omitempty"`
   377  
   378     // Selector to be used for fetching provider’s components and metadata from
   379     // ConfigMaps stored inside the cluster. Each ConfigMap is expected to contain
   380     // components and metadata for a specific version only.
   381     // +optional
   382     Selector *metav1.LabelSelector `json:"selector,omitempty"`
   383  }
   384  
   385  // ProviderStatus defines the observed state of the Provider.
   386  type ProviderStatus struct {
   387     // Contract will contain the core provider contract that the provider is
   388     // abiding by, like e.g. v1alpha3.
   389     // +optional
   390     Contract *string `json:"contract,omitempty"`
   391  
   392     // Conditions define the current service state of the cluster.
   393     // +optional
   394     Conditions Conditions `json:"conditions,omitempty"`
   395  
   396     // ObservedGeneration is the latest generation observed by the controller.
   397     // +optional
   398     ObservedGeneration int64 `json:"observedGeneration,omitempty"`
   399  }
   400  ```
   401  
   402  **Validation and defaulting rules for Provider and ProviderSpec**
   403  - The `Name` field within `metav1.ObjectMeta` could be any valid Kubernetes
   404    name; however, it is recommended to use Cluster API provider names. For
   405    example, aws, vsphere, kubeadm. These names will be used to fetch the
   406    default configurations in case there is no specific FetchConfiguration
   407    defined.
   408  - `ProviderSpec.Version` should be a valid default version with the "v" prefix
   409    as commonly used in the Kubernetes ecosystem; if this value is nil when a
   410    new provider is created, the operator will determine the version to use
   411    applying the same rules implemented in clusterctl (latest).
   412    Once the latest version is calculated it will be set in
   413    `ProviderSpec.Version`.
   414  - Note: As per discussion in the CAEP PR, we will keep the `SecretName` field
   415    to allow the provider authors ample time to implement their own credential
   416    management to support multiple workload clusters. [See this thread for more
   417    info][secret-name-discussion].
   418  
   419  **Validation rules for ProviderSpec.FetchConfiguration**
   420  - If the FetchConfiguration is empty and not defined, then the operator will
   421    apply the embedded fetch configuration for the given kind and
   422    `ObjectMeta.Name`. For example, the infrastructure name `aws` will fetch
   423    artifacts from
   424    https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases.
   425  - If FetchConfiguration is not nil, exactly one of `URL` or `Selector` must be
   426    specified.
   427  - `FetchConfiguration.Selector` is used to fetch provider’s components and
   428    metadata from ConfigMaps stored inside the cluster. Each ConfigMap is
   429    expected to contain components and metadata for a specific version only. So
   430    if multiple versions of the providers need to be specified, they can be
   431    added as separate ConfigMaps and labeled with the same selector. This
   432    provides the same behavior as the “local” provider repositories but now from
   433    within the management cluster.
   434  - `FetchConfiguration` is used only during init and upgrade operations.
   435    Changes made to the contents of `FetchConfiguration` will not trigger a
   436    reconciliation. This is similar behavior to `ProviderSpec.SecretName`.
   437  
   438  **Validation Rules for ProviderSpec.ManagerSpec**
   439  - The ControllerManagerConfigurationSpec is a type from
   440    `controller-runtime/pkg/config` and is an embedded into the `ManagerSpec`.
   441    This type will expose LeaderElection, SyncPeriod, Webhook, Health and
   442    Metrics configurations.
   443  - If `ManagerSpec.Debug` is set to true, the operator will not allow changes
   444    to other properties since it is in Debug mode.
   445  - If you need to set specific concurrency values for each reconcile loop (e.g.
   446    `awscluster-concurrency`), you can leave
   447    `ManagerSpec.MaxConcurrentReconciles` nil and use `Container.Args`.
   448  - If `ManagerSpec.MaxConcurrentReconciles` is set and a specific concurrency
   449    flag such as `awscluster-concurrency` is set on the `Container.Args`, then
   450    the more specific concurrency flag will have higher precedence.
   451  
   452  
   453  **Validation Rules for ContainerSpec**
   454  - The `ContainerSpec.Args` will ignore the key `namespace` since the operator
   455    enforces a deployment model where all the providers should be configured to
   456    watch all the namespaces.
   457  - Explicit controller manager properties defined in the `Provider.ManagerSpec`
   458    will have higher precedence than those defined in `ContainerSpec.Args`. That
   459    is, if `ManagerSpec.SyncPeriod` is defined it will be used instead of the
   460    container arg `sync-period`. This is true also for
   461    `ManagerSpec.FeatureGates`, that is, it will have higher precedence to the
   462    container arg `feature-gates`.
   463  - If no `ContainerSpec.Resources` are defined, the defaults on the Deployment
   464    object within the provider’s components yaml will be used.
   465  
   466  
   467  #### Example API Usage
   468  
   469  1. As an admin, I want to install the aws infrastructure provider with
   470     specific controller flags.
   471  
   472  ```yaml
   473  apiVersion: v1
   474  kind: Secret
   475  metadata:
   476   name: aws-variables
   477   namespace: capa-system
   478  type: Opaque
   479  data:
   480   AWS_REGION: ...
   481   AWS_ACCESS_KEY_ID: ...
   482   AWS_SECRET_ACCESS_KEY: ...
   483  ---
   484  apiVersion: management.cluster.x-k8s.io/v1alpha1
   485  kind: InfrastructureProvider
   486  metadata:
   487   name: aws
   488   namespace: capa-system
   489  spec:
   490   version: v0.6.0
   491   secretName: aws-variables
   492   manager:
   493     # These top level controller manager flags, supported by all the providers.
   494     # These flags come with sensible defaults, thus requiring no or minimal
   495     # changes for the most common scenarios.
   496     metricsAddress: ":8181"
   497     syncPeriod: 660
   498   fetchConfig:
   499     url: https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases
   500   deployment:
   501     containers:
   502     - name: manager
   503       args:
   504           # These are controller flags that are specific to a provider; usage
   505           # is reserved for advanced scenarios only.
   506           awscluster-concurrency: 12
   507           awsmachine-concurrency: 11
   508  ```
   509  
   510  2. As an admin, I want to install aws infrastructure provider but override
   511     the container image of the CAPA deployment.
   512  
   513  ```yaml
   514  ---
   515  apiVersion: management.cluster.x-k8s.io/v1alpha1
   516  kind: InfrastructureProvider
   517  metadata:
   518   name: aws
   519   namespace: capa-system
   520  spec:
   521   version: v0.6.0
   522   secretName: aws-variables
   523   deployment:
   524     containers:
   525     - name: manager
   526       image: gcr.io/myregistry/capa-controller:v0.6.0-foo
   527  ```
   528  
   529  3. As an admin, I want to change the resource limits for the manager pod in
   530     my control plane provider deployment.
   531  
   532  ```yaml
   533  ---
   534  apiVersion: management.cluster.x-k8s.io/v1alpha1
   535  kind: ControlPlaneProvider
   536  metadata:
   537   name: kubeadm
   538   namespace: capi-kubeadm-control-plane-system
   539  spec:
   540   version: v0.3.10
   541   secretName: capi-variables
   542   deployment:
   543     containers:
   544     - name: manager
   545       resources:
   546         limits:
   547           cpu: 100m
   548           memory: 30Mi
   549         requests:
   550           cpu: 100m
   551           memory: 20Mi
   552  ```
   553  
   554  4. As an admin, I would like to fetch my azure provider components from a
   555     specific repository which is not the default.
   556  
   557  ```yaml
   558  ---
   559  apiVersion: management.cluster.x-k8s.io/v1alpha1
   560  kind: InfrastructureProvider
   561  metadata:
   562   name: myazure
   563   namespace: capz-system
   564  spec:
   565   version: v0.4.9
   566   secretName: azure-variables
   567   fetchConfig:
   568     url: https://github.com/myorg/awesome-azure-provider/releases
   569  
   570  ```
   571  
   572  5. As an admin, I would like to use the default fetch configurations by
   573     simply specifying the expected Cluster API provider names such as 'aws',
   574     'vsphere', 'azure', 'kubeadm', 'talos', or 'cluster-api' instead of having
   575     to explicitly specify the fetch configuration.
   576     In the example below, since we are using 'vsphere' as the name of the
   577     InfrastructureProvider the operator will fetch it's configuration from
   578     `url: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/releases`
   579     by default.
   580  
   581  See more examples in the [air-gapped environment section](#air-gapped-environment)
   582  
   583  ```yaml
   584  ---
   585  apiVersion: management.cluster.x-k8s.io/v1alpha1
   586  kind: InfrastructureProvider
   587  metadata:
   588   name: vsphere
   589   namespace: capv-system
   590  spec:
   591   version: v0.4.9
   592   secretName: vsphere-variables
   593  
   594  ```
   595  
   596  #### Operator Behaviors
   597  
   598  ##### Installing a provider
   599  
   600  In order to install a new Cluster API provider with the management cluster
   601  operator you have to create a provider as shown above. See the first example
   602  API usage to create the secret with variables and the provider itself.
   603  
   604  When processing a Provider object the operator will apply the following rules.
   605  
   606  - Providers with  `spec.Type == CoreProvider` will be installed first; the
   607    other providers will be requeued until the core provider exists.
   608  - Before installing any provider following preflight checks will be executed :
   609      - There should not be another instance of the same provider (same Kind, same
   610        name) in any namespace.
   611      - The Cluster API contract the provider is abiding by, e.g. v1alpha4, must
   612        match the contract of the core provider.
   613  - The operator will set conditions on the Provider object to surface any
   614    installation issues such as pre-flight checks and/or order of installation
   615    to accurately inform the user.
   616  - Since the FetchConfiguration is empty and not defined, the operator will
   617    apply the embedded fetch configuration for the given kind and
   618    `ObjectMeta.Name`. In this case, the operator will fetch artifacts from
   619    https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases.
   620  
   621  The installation process managed by the operator is consistent with the
   622  implementation underlying the `clusterctl init` command and includes the
   623  following steps:
   624  - Fetching the provider artifacts (the components yaml and the metadata.yaml
   625    file).
   626  - Applying image overrides, if any.
   627  - Replacing variables in the infrastructure-components from EnvVar and
   628    Secret.
   629  - Applying the resulting yaml to the cluster.
   630  
   631  As a final consideration, please note that
   632  - The operator executes installation for 1 provider at time, while `clusterctl
   633    init` manages installation of a group of providers with a single operation.
   634  - `clusterctl init` uses environment variables and a local configuration file,
   635    while the operator uses a Secret; given that we want the users to preserve
   636    current behaviour in clusterctl, the init operation should be modified to
   637    transfer local configuration to the cluster.
   638    As part of `clusterctl init`, it will obtain the list of variables required
   639    by the provider components and read the corresponding values from the config
   640    or environment variables and build the secret.
   641    Any image overrides defined in the clusterctl config will also be applied to
   642    the provider's components.
   643  
   644  In the following figure, the controllers for the providers are installed in
   645  the namespaces that are defined by default.
   646  
   647  ![Figure 1](./images/capi-provider-operator/fig3.png "Figure for
   648  installing providers in defined namespaces")
   649  <div align="center">Installing providers in defined namespaces</div>
   650  <br/>
   651  
   652  In the following figure, the controllers for the providers are all installed in
   653  the same namespace as configured by the user.
   654  
   655  ![Figure 2](./images/capi-provider-operator/fig4.png "Figure for
   656  installing all providers in the same namespace")
   657  <div align="center">Installing all providers in the same namespace</div>
   658  <br/>
   659  
   660  ##### Upgrading a provider
   661  
   662  In order to trigger an upgrade of a new Cluster API provider you have to
   663  change the `spec.Version` field.
   664  
   665  Upgrading a provider in the management cluster must abide by the golden rule
   666  that all the providers should respect the same Cluster API contract supported
   667  by the core provider.
   668  
   669  ##### Upgrades providers without changing contract
   670  
   671  If the new version of the provider does abide by the same version of the
   672  Cluster API contract, the operator will execute the upgrade by performing:
   673  - Delete of the current instance of the provider components, while preserving
   674    CRDs, namespace and user objects.
   675  - Install the new version of the provider components
   676  
   677  Please note that:
   678  - The operator executes upgrades 1 provider at time, while `clusterctl upgrade
   679    apply` manages upgrading a group of providers with a single operation.
   680  - `clusterctl upgrade apply --contract` automatically determines the latest
   681    versions available for each provider, while with the Declarative approach
   682    the user is responsible for manually editing Provider objects yaml.
   683  - `clusterctl upgrade apply` currently uses environment variables and a local
   684    configuration file; this should be changed in order to use in cluster
   685    provider configurations.
   686  
   687  ![Figure 3](./images/capi-provider-operator/fig1.png "Figure for
   688  upgrading provider without changing contract")
   689  <div align="center">Upgrading providers without changing contract</div>
   690  <br/>
   691  
   692  ##### Upgrades providers and changing contract
   693  
   694  If the new version of the provider does abide by a new version of the Cluster
   695  API contract, it is required to ensure all the other providers in the
   696  management cluster should get the new version too.
   697  
   698  ![Figure 4](./images/capi-provider-operator/fig2.png "Figure for
   699  upgrading provider and changing contract")
   700  <div align="center">Upgrading providers and changing contract</div>
   701  <br/>
   702  
   703  As a first step, it is required to pause all the providers by setting the
   704  `spec.Paused` field to true for each provider; the operator will block any
   705  contract upgrade until all the providers are paused.
   706  
   707  After all the providers are in paused state, you can proceed with the upgrade
   708  as described in the previous paragraph (change the `spec.Version` field).
   709  
   710  When a provider is paused the number of replicas will be scaled to 0; the
   711  operator will add a new
   712  `management.cluster.x-k8s.io/original-controller-replicas` annotation to store
   713  the original replica count.
   714  
   715  Once all the providers are upgraded to a version that abides to the new
   716  contract, it is possible for the operator to unpause providers; the operator
   717  does not allow to unpause providers if there are still providers abiding to
   718  the old contract.
   719  
   720  Please note that we are planning to embed this sequence (pause - upgrade -
   721  unpause) as a part of `clusterctl upgrade apply` command when there is a
   722  contract change.
   723  
   724  ##### Changing a provider
   725  
   726  On top of changing a provider version (upgrades), the operator supports also
   727  changing other provider fields, most notably controller flags and variables.
   728  This can be achieved by either `kubectl edit` or `kubectl apply` to the
   729  provider object.
   730  
   731  The operation internally works like upgrades: The current instance of the
   732  provider is deleted, while preserving CRDs, namespaced and user objects A new
   733  instance of the provider is installed with the new set of flags/variables.
   734  
   735  Please note that clusterctl currently does not support this operation.
   736  
   737  See Example 1 in [Example API Usage](#example-api-usage)
   738  
   739  ##### Deleting a provider
   740  
   741  In order to delete a provider you have to delete the corresponding provider
   742  object.
   743  
   744  Deletion of the provider will be blocked if any workload cluster using the
   745  provider still exists.
   746  
   747  Additionally, deletion of a core provider should be blocked if there are still
   748  other providers in the management cluster.
   749  
   750  #### Upgrade from v1alpha3 management cluster to v1alpha4 cluster
   751  
   752  Cluster API will provide instructions on how to upgrade from a v1alpha3
   753  management cluster, created by clusterctl to the new v1alpha4 management
   754  cluster. These operations could require manual actions.
   755  
   756  Some of the actions are described below:
   757  - Run webhooks as part of the main manager. See [issue 3822].
   758  
   759  More details will be added as we better understand what a v1alpha4 cluster
   760  will look like.
   761  
   762  #### Operator Lifecycle Management
   763  
   764  ##### Operator Installation
   765  
   766  - During the first phase of implementation `clusterctl` won't provide support
   767    for managing the operator, so the admin will have to install it manually using 
   768    `kubectl apply` (or similar solutions), the operator yaml that will be published in the
   769    operator subproject release artifacts.
   770  - In future `clusterctl init` will install the operator and its corresponding CRDs 
   771    as a pre-requisite if the operator doesn’t already exist. Please note that this
   772    command will consider image overrides defined in the local clusterctl config
   773    file.
   774  
   775  ##### Operator Upgrade
   776  - During the first phase of implementation `clusterctl` operations will not be 
   777    supported and admin will have to install the operator manually, or in case if
   778    the admin doesn’t want to use clusterctl, they can use `kubectl apply`(or similar solutions) 
   779    with the latest version of the operator yaml that will be published in the
   780    operator subproject release artifacts.
   781  - The transition between manually managed operator and clusterctl managed
   782    operator will be documented later as we progress with the implementation.
   783  - In future the admin will be able to use `clusterctl upgrade operator` to 
   784    upgrade the operator components. Please note that this command will consider 
   785    image overrides defined in the local clusterctl config file. Other commands 
   786    such as `clusterctl upgrade apply` will also allow to upgrade the operator.
   787  - `clusterctl upgrade plan` will identify when the operator can be upgraded by
   788    checking the cluster-api release artifacts.
   789  - clusterctl will require a matching operator version. In the future, when
   790    clusterctl move to beta/GA, we will reconsider supporting version skew
   791    between clusterctl and the operator.
   792  
   793  ##### Operator Delete
   794  - During the first phase of implementation `clusterctl` operations will not be 
   795    supported and admin will have to delete the operator manually using `kubectl delete`
   796    (or similar solutions). However, it’s the admin’s responsibility to verify that there 
   797    are no providers running in the management cluster.
   798  - In future the clusterctl will delete the operator as part of 
   799    the `clusterctl delete --all` command.
   800  
   801  #### Air gapped environment
   802  
   803  In order to install Cluster API providers in an air-gapped environment using
   804  the operator, it is required to address the following issues.
   805  
   806  1. Make the operator work in air-gapped environment
   807     - To provide image overrides for the operator itself in order to pull the
   808       images from an accessible image repository. Please note that the
   809       overrides will be considered from the image overrides defined in the
   810       local clusterctl config file.
   811     - TBD if operator yaml will be embedded in clusterctl or if it should be a
   812       special artifact within the core provider repository.
   813  1. Make the providers work in air-gapped environment
   814     - To provide fetch configuration for each provider reading from an
   815       accessible location (e.g. an internal github repository) or from
   816       ConfigMaps pre-created inside the cluster.
   817     - To provide image overrides for each provider in order to pull the images
   818       from an accessible image repository.
   819  
   820  **Example Usage**
   821  
   822  As an admin, I would like to fetch my azure provider components from within
   823  the cluster because I’m working within an air-gapped environment.
   824  
   825  In this example, we have two config maps that define the components and
   826  metadata of the provider. They each share the label `provider-components:
   827  azure` and are within the `capz-system` namespace.
   828  
   829  The azure InfrastructureProvider has a `fetchConfig` which specifies the label
   830  selector. This way the operator knows which versions of the azure provider are
   831  available. Since the provider’s version is marked as `v0.4.9`, it uses the
   832  components information from the config map to install the azure provider.
   833  
   834  ```yaml
   835  ---
   836  apiVersion: v1
   837  kind: ConfigMap
   838  metadata:
   839   labels:
   840     provider-components: azure
   841   name: v0.4.9
   842   namespace: capz-system
   843  data:
   844   components: |
   845   # components for v0.4.9 yaml goes here
   846   metadata: |
   847   # metadata information goes here
   848  ---
   849  apiVersion: v1
   850  kind: ConfigMap
   851  metadata:
   852   labels:
   853     provider-components: azure
   854   name: v0.4.8
   855   namespace: capz-system
   856  data:
   857   components: |
   858   # components for v0.4.8 yaml goes here
   859   metadata: |
   860   # metadata information goes here
   861  ---
   862  apiVersion: management.cluster.x-k8s.io/v1alpha1
   863  kind: InfrastructureProvider
   864  metadata:
   865   name: azure
   866   namespace: capz-system
   867  spec:
   868   version: v0.4.9
   869   secretName: azure-variables
   870   fetchConfig:
   871     selector:
   872       matchLabels:
   873         provider-components: azure
   874  ```
   875  
   876  ### Risks and Mitigation
   877  
   878  #### Error Handling & Logging
   879  
   880  Currently, clusterctl provides quick feedback regarding required variables
   881  etc. With the operator in place we’ll need to ensure that the error messages
   882  and logs are easily available to the user to verify progress.
   883  
   884  #### Extensibility Options
   885  
   886  Currently, clusterctl has a few extensibility options.  For example,
   887  clusterctl is built on-top of a library that can be leveraged to build other
   888  tools.
   889  
   890  It also exposes an interface for template processing if we choose to go a
   891  different route from `envsubst`. This may prove to be challenging in the
   892  context of the operator as this would mean a change to the operator
   893  binary/image. We could introduce a new behavior or communication protocol or
   894  hooks for the operator to interact with the custom template processor. This
   895  could be configured similarly to the fetch config, with multiple options built
   896  in.
   897  
   898  We have decided that supporting multiple template processors is a non-goal for
   899  this implementation of the proposal and we will rely on using the default
   900  `envsubst` template processor.
   901  
   902  #### Upgrade from v1alpha3 management cluster to v1alpha4/operator cluster
   903  
   904  As of today, this is hard to define as have yet to understand the definition
   905  of what a v1alpha4 cluster will be. Once we better understand what a v1alpha4
   906  cluster will look like, we will then be able to determine the upgrade sequence
   907  from v1alpha3.
   908  
   909  Cluster API will provide instructions on how to upgrade from a v1alpha3
   910  management cluster, created by clusterctl to the new v1alpha4 management
   911  cluster. These operations could require manual actions.
   912  
   913  Some of the actions are described below:
   914  - Run webhooks as part of the main manager. See [issue
   915    3822](https://github.com/kubernetes-sigs/cluster-api/issues/3822)
   916  
   917  ## Additional Details
   918  
   919  ### Test Plan
   920  
   921  The operator will be written with unit and integration tests using envtest and
   922  existing patterns as defined under the [Developer
   923  Guide/Testing](https://cluster-api.sigs.k8s.io/developer/testing.html) section
   924  in the Cluster API book.
   925  
   926  Existing E2E tests will verify that existing clusterctl commands such as `init`
   927  and `upgrade` will work as expected. Any necessary changes will be made in
   928  order to make it configurable.
   929  
   930  New E2E tests verifying the operator lifecycle itself will be added.
   931  
   932  New E2E tests verifying the upgrade from a v1alpha3 to v1alpha4 cluster will
   933  be added.
   934  
   935  ### Version Skew Strategy
   936  
   937  - clusterctl will require a matching operator version. In the future, when
   938    clusterctl move to beta/GA, we will reconsider supporting version skew
   939    between clusterctl and the operator.
   940  
   941  ## Implementation History
   942  
   943  - [x] 09/09/2020: Proposed idea in an issue or [community meeting]
   944  - [x] 09/14/2020: Compile a [Google Doc following the CAEP template][management cluster operator caep]
   945  - [x] 09/14/2020: First round of feedback from community
   946  - [x] 10/07/2020: Present proposal at a [community meeting]
   947  - [ ] 10/20/2020: Open proposal PR
   948  
   949  ## Controller Runtime Types
   950  
   951  These types are pulled from [controller-runtime][controller-runtime-code-ref]
   952  and [component-base][components-base-code-ref]. They are used as part of the
   953  `ManagerSpec`. They are duplicated here for convenience sake.
   954  
   955  ```golang
   956  // ControllerManagerConfigurationSpec defines the desired state of GenericControllerManagerConfiguration
   957  type ControllerManagerConfigurationSpec struct {
   958  	// SyncPeriod determines the minimum frequency at which watched resources are
   959  	// reconciled. A lower period will correct entropy more quickly, but reduce
   960  	// responsiveness to change if there are many watched resources. Change this
   961  	// value only if you know what you are doing. Defaults to 10 hours if unset.
   962  	// there will a 10 percent jitter between the SyncPeriod of all controllers
   963  	// so that all controllers will not send list requests simultaneously.
   964  	// +optional
   965  	SyncPeriod *metav1.Duration `json:"syncPeriod,omitempty"`
   966  
   967  	// LeaderElection is the LeaderElection config to be used when configuring
   968  	// the manager.Manager leader election
   969  	// +optional
   970  	LeaderElection *configv1alpha1.LeaderElectionConfiguration `json:"leaderElection,omitempty"`
   971  
   972  	// CacheNamespace if specified restricts the manager's cache to watch objects in
   973  	// the desired namespace Defaults to all namespaces
   974  	//
   975  	// Note: If a namespace is specified, controllers can still Watch for a
   976  	// cluster-scoped resource (e.g Node).  For namespaced resources the cache
   977  	// will only hold objects from the desired namespace.
   978  	// +optional
   979  	CacheNamespace string `json:"cacheNamespace,omitempty"`
   980  
   981  	// GracefulShutdownTimeout is the duration given to runnable to stop before the manager actually returns on stop.
   982  	// To disable graceful shutdown, set to time.Duration(0)
   983  	// To use graceful shutdown without timeout, set to a negative duration, e.G. time.Duration(-1)
   984  	// The graceful shutdown is skipped for safety reasons in case the leader election lease is lost.
   985  	GracefulShutdownTimeout *metav1.Duration `json:"gracefulShutDown,omitempty"`
   986  
   987  	// Metrics contains thw controller metrics configuration
   988  	// +optional
   989  	Metrics ControllerMetrics `json:"metrics,omitempty"`
   990  
   991  	// Health contains the controller health configuration
   992  	// +optional
   993  	Health ControllerHealth `json:"health,omitempty"`
   994  
   995  	// Webhook contains the controllers webhook configuration
   996  	// +optional
   997  	Webhook ControllerWebhook `json:"webhook,omitempty"`
   998  }
   999  
  1000  // ControllerMetrics defines the metrics configs
  1001  type ControllerMetrics struct {
  1002  	// BindAddress is the TCP address that the controller should bind to
  1003  	// for serving prometheus metrics.
  1004  	// It can be set to "0" to disable the metrics serving.
  1005  	// +optional
  1006  	BindAddress string `json:"bindAddress,omitempty"`
  1007  }
  1008  
  1009  // ControllerHealth defines the health configs
  1010  type ControllerHealth struct {
  1011  	// HealthProbeBindAddress is the TCP address that the controller should bind to
  1012  	// for serving health probes
  1013  	// +optional
  1014  	HealthProbeBindAddress string `json:"healthProbeBindAddress,omitempty"`
  1015  
  1016  	// ReadinessEndpointName, defaults to "readyz"
  1017  	// +optional
  1018  	ReadinessEndpointName string `json:"readinessEndpointName,omitempty"`
  1019  
  1020  	// LivenessEndpointName, defaults to "healthz"
  1021  	// +optional
  1022  	LivenessEndpointName string `json:"livenessEndpointName,omitempty"`
  1023  }
  1024  
  1025  // ControllerWebhook defines the webhook server for the controller
  1026  type ControllerWebhook struct {
  1027  	// Port is the port that the webhook server serves at.
  1028  	// It is used to set webhook.Server.Port.
  1029  	// +optional
  1030  	Port *int `json:"port,omitempty"`
  1031  
  1032  	// Host is the hostname that the webhook server binds to.
  1033  	// It is used to set webhook.Server.Host.
  1034  	// +optional
  1035  	Host string `json:"host,omitempty"`
  1036  
  1037  	// CertDir is the directory that contains the server key and certificate.
  1038  	// if not set, webhook server would look up the server key and certificate in
  1039  	// {TempDir}/k8s-webhook-server/serving-certs. The server key and certificate
  1040  	// must be named tls.key and tls.crt, respectively.
  1041  	// +optional
  1042  	CertDir string `json:"certDir,omitempty"`
  1043  }
  1044  
  1045  // LeaderElectionConfiguration defines the configuration of leader election
  1046  // clients for components that can run with leader election enabled.
  1047  type LeaderElectionConfiguration struct {
  1048  	// leaderElect enables a leader election client to gain leadership
  1049  	// before executing the main loop. Enable this when running replicated
  1050  	// components for high availability.
  1051  	LeaderElect *bool `json:"leaderElect"`
  1052  	// leaseDuration is the duration that non-leader candidates will wait
  1053  	// after observing a leadership renewal until attempting to acquire
  1054  	// leadership of a led but unrenewed leader slot. This is effectively the
  1055  	// maximum duration that a leader can be stopped before it is replaced
  1056  	// by another candidate. This is only applicable if leader election is
  1057  	// enabled.
  1058  	LeaseDuration metav1.Duration `json:"leaseDuration"`
  1059  	// renewDeadline is the interval between attempts by the acting master to
  1060  	// renew a leadership slot before it stops leading. This must be less
  1061  	// than or equal to the lease duration. This is only applicable if leader
  1062  	// election is enabled.
  1063  	RenewDeadline metav1.Duration `json:"renewDeadline"`
  1064  	// retryPeriod is the duration the clients should wait between attempting
  1065  	// acquisition and renewal of a leadership. This is only applicable if
  1066  	// leader election is enabled.
  1067  	RetryPeriod metav1.Duration `json:"retryPeriod"`
  1068  	// resourceLock indicates the resource object type that will be used to lock
  1069  	// during leader election cycles.
  1070  	ResourceLock string `json:"resourceLock"`
  1071  	// resourceName indicates the name of resource object that will be used to lock
  1072  	// during leader election cycles.
  1073  	ResourceName string `json:"resourceName"`
  1074  	// resourceName indicates the namespace of resource object that will be used to lock
  1075  	// during leader election cycles.
  1076  	ResourceNamespace string `json:"resourceNamespace"`
  1077  }
  1078  ```
  1079  
  1080  <!-- Links -->
  1081  [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY
  1082  [management cluster operator caep]: https://docs.google.com/document/d/1fQNlqsDkvEggWFi51GVxOglL2P1Bvo2JhZlMhm2d-Co/edit#
  1083  [controller-runtime-code-ref]: https://github.com/kubernetes-sigs/controller-runtime/blob/5c2b42d0dfe264fe1a187dcb11f384c0d193c042/pkg/config/v1alpha1/types.go
  1084  [components-base-code-ref]: https://github.com/kubernetes/component-base/blob/3b346c3e81285da5524c9379262ad4ca327b3c75/config/v1alpha1/types.go
  1085  [issue 3042]: https://github.com/kubernetes-sigs/cluster-api/issues/3042
  1086  [issue 3354]: https://github.com/kubernetes-sigs/cluster-api/issues/3354
  1087  [issue 3822]: https://github.com/kubernetes-sigs/cluster-api/issues/3822)
  1088  [secret-name-discussion]: https://github.com/kubernetes-sigs/cluster-api/pull/3833#discussion_r540576353