sigs.k8s.io/cluster-api@v1.6.3/docs/proposals/20190919-machinepool-api.md (about) 1 --- 2 title: MachinePool API 3 authors: 4 - "@juan-lee" 5 - "@CecileRobertMichon" 6 reviewers: 7 - "@detiber" 8 - "@justaugustus" 9 - "@ncdc" 10 - "@vincepri" 11 creation-date: 2019-09-19 12 last-updated: 2019-11-24 13 replaces: 14 - [cluster-api-provider-azure Proposal](https://docs.google.com/document/d/1nbOqCIC0-ezdMXubZIV6EQrzD0QYPrpcdCBB4oSjWeQ/edit) 15 status: provisional 16 --- 17 18 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 19 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 20 21 - [MachinePool API](#machinepool-api) 22 - [Glossary](#glossary) 23 - [Summary](#summary) 24 - [Motivation](#motivation) 25 - [Goals](#goals) 26 - [Non-goals/Future Work](#non-goalsfuture-work) 27 - [Proposal](#proposal) 28 - [User Stories](#user-stories) 29 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 30 - [Data Model Changes](#data-model-changes) 31 - [States and Transitions](#states-and-transitions) 32 - [Pending](#pending) 33 - [Transition Conditions](#transition-conditions) 34 - [Expectations](#expectations) 35 - [Provisioning](#provisioning) 36 - [Transition Conditions](#transition-conditions-1) 37 - [Expectations](#expectations-1) 38 - [Provisioned](#provisioned) 39 - [Transition Conditions](#transition-conditions-2) 40 - [Expectations](#expectations-2) 41 - [Running](#running) 42 - [Transition Conditions](#transition-conditions-3) 43 - [Expectations](#expectations-3) 44 - [Deleting](#deleting) 45 - [Transition Conditions](#transition-conditions-4) 46 - [Expectations](#expectations-4) 47 - [Failed](#failed) 48 - [Transition Conditions](#transition-conditions-5) 49 - [Expectations](#expectations-5) 50 - [Controller Collaboration Diagram](#controller-collaboration-diagram) 51 - [CABPK Changes](#cabpk-changes) 52 - [Bootstrap Token lifetimes](#bootstrap-token-lifetimes) 53 - [Risks and Mitigations](#risks-and-mitigations) 54 - [MachinePool type might not cover all potential infrastructure providers](#machinepool-type-might-not-cover-all-potential-infrastructure-providers) 55 - [Infrastructure Provider Features Required for MachinePool v1alpha3](#infrastructure-provider-features-required-for-machinepool-v1alpha3) 56 - [Infrastructure Provider Features Potentially Required for MachinePool post-v1alpha3](#infrastructure-provider-features-potentially-required-for-machinepool-post-v1alpha3) 57 - [Alternatives](#alternatives) 58 - [Upgrade Strategy](#upgrade-strategy) 59 - [Additional Details](#additional-details) 60 - [Test Plan](#test-plan) 61 - [Graduation Criteria](#graduation-criteria) 62 - [Drawbacks](#drawbacks) 63 - [Infrastructure Provider Behavior Differences](#infrastructure-provider-behavior-differences) 64 - [Implementation History](#implementation-history) 65 66 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 67 68 69 # MachinePool API 70 71 ## Glossary 72 The lexicon used in this document is described in more detail 73 [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/reference/glossary.md). 74 Any discrepancies should be rectified in the main Cluster API glossary. 75 76 - **ASG** - AWS Auto Scale Group 77 - **MIG** - GCP Managed Instance Group 78 - **VMSS** - Azure Virtual Machine Scale Set 79 80 ## Summary 81 82 In Cluster API (CAPI) v1alpha2, users can create MachineDeployment, MachineSet or Machine custom 83 resources. When you create a MachineDeployment or MachineSet, Cluster API components react and 84 eventually Machine resources are created. Cluster API's current architecture mandates that a Machine 85 maps to a single machine (virtual or bare metal) with the provider being responsible for the 86 management of the underlying machine's infrastructure. 87 88 Nearly all infrastructure providers have a way for their users to manage a group of machines 89 (virtual or bare metal) as a single entity. Each infrastructure provider offers their own unique 90 features, but nearly all are concerned with managing availability, health, and configuration 91 updates. 92 93 This proposal outlines adding a MachinePool API (type/controller) for managing many machines as a 94 single entity. A MachinePool is similar to a MachineDeployment in that they both define 95 configuration and policy for how a set of machines are managed. They Both define a common 96 configuration, number of desired machine replicas, and policy for update. Both types also combine 97 information from Kubernetes as well as the underlying provider infrastructure to give a view of the 98 overall health of the machines in the set. 99 100 MachinePool diverges from MachineDeployment in that the MachineDeployment controller uses 101 MachineSets to achieve the aforementioned desired number of machines and to orchestrate updates to 102 the Machines in the managed set, while MachinePool delegates the responsibility of these concerns to 103 an infrastructure provider specific resource such as AWS Auto Scale Groups, GCP Managed Instance 104 Groups, and Azure Virtual Machine Scale Sets. 105 106 MachinePool is optional and doesn't replace the need for MachineSet/Machine since not every 107 infrastructure provider will have an abstraction for managing multiple machines (i.e. bare metal). 108 Users may always opt to choose MachineSet/Machine when they don't see additional value in 109 MachinePool for their use case. 110 111 ## Motivation 112 113 Infrastructure providers have invested a significant amount of time optimizing the way users manage 114 sets of machines as a single entity. The interface exposed by each infrastructure provider has a lot 115 of commonalities with the MachineDeployment type. Allowing users of CAPI to leverage the 116 optimizations exposed by each infrastructure provider could prove beneficial. 117 118 **Potential benefits include:** 119 - Faster machine provisioning 120 - Improved provisioning success rates 121 - Automatic distribution of machines across availability zones if supported by the infrastructure 122 provider 123 - CAPI initiated rolling update of machines 124 - Higher maximum machines in a cluster (Azure limitations) 125 - Auto-scaling 126 127 ### Goals 128 129 - To expose the MachinePool API for infrastructure providers to leverage their optimizations around managing large sets of machines. 130 - Support for user initiated scale up/down. 131 - Support for declarative rolling update. 132 133 ### Non-goals/Future Work 134 135 - To support enabling infrastructure provider specific autoscalers. (at least in v1alpha3) 136 - To support cordon/drain during infrastructure provider specific rolling update. 137 - To manage control plane nodes with the MachinePool API. 138 - To integrate MachinePool with the kubernetes cluster autoscaler. 139 140 ## Proposal 141 142 This proposal introduces the MachinePool API for the purpose of delegating the management of pools 143 of machines to infrastructure provider supplied controllers. 144 145 ### User Stories 146 147 - As an infrastructure provider author, I would like to build a controller to manage multiple 148 machines with a common configuration using my provider specific resource for doing so. 149 - As a cluster operator, I would like to use MachinePool, similar to how I'm using MachineDeployment 150 today, to manage a set of machines with a common configuration. 151 152 ### Implementation Details/Notes/Constraints 153 154 #### Data Model Changes 155 156 MachinePool Spec and Status introduces the integration point for delegating the management of a set 157 of machines to the infrastructure provider. Many of the fields are shared with MachineDeployment due 158 to infrastructure provider's desire to enable the management of a set of machines with a single 159 configuration. 160 161 ``` go 162 type MachinePoolSpec struct 163 ``` 164 165 - **To add** 166 - **ClusterName [required]** 167 - Type: `string` 168 - Description: Name of the Cluster this machine pool belongs to. 169 - **FailureDomains [optional]** 170 - Type: `[]string` 171 - Description: FailureDomains is the list of failure domains this MachinePool should be attached to. 172 - **Replicas [optional]** 173 - Type: `*int32` 174 - Description: Number of desired machine instances. Defaults to 1. 175 - **Template [required]** 176 - Type: `MachineTemplateSpec` 177 - Description: Machine Template that describes the configuration of each machine instance in a 178 machine pool. 179 - **MinReadySeconds [optional]** 180 - Type: `*int32` 181 - Description: Minimum number of seconds for which a newly created machine should be ready. 182 - **ProviderIDList [optional]** 183 - Type: `[]string` 184 - Description: ProviderIDList contain a ProviderID for each machine instance that's currently 185 managed by the infrastructure provider belonging to the machine pool. 186 187 ``` go 188 type MachinePoolStatus struct 189 ``` 190 191 - **To add** 192 - **NodeRefs [optional]** 193 - Type: `[]corev1.ObjectReference` 194 - Description: NodeRefs contain a NodeRef for each ProviderID in MachinePoolSpec.ProviderIDList. 195 - **Replicas [optional]** 196 - Type: `*int32` 197 - Description: Replicas is the most recent observed number of replicas. 198 - **ReadyReplicas [optional]** 199 - Type: `*int32` 200 - Description: The number of ready replicas for this MachinePool. 201 - **AvailableReplicas [optional]** 202 - Type: `*int32` 203 - Description: The number of available replicas (ready for at least minReadySeconds) for this 204 MachinePool. 205 - **UnavailableReplicas [optional]** 206 - Type: `*int32` 207 - Description: Total number of unavailable machines targeted by this machine pool. This is the 208 total number of machines that are still required for this machine pool to have 100% available 209 capacity. They may either be machines that are running but not yet available or machines that 210 still have not been created. 211 - **FailureReason [optional]** 212 - Type: `*capierrors.MachinePoolStatusError` 213 - Description: FailureReason will be set in the event that there is a terminal problem 214 reconciling the MachinePool and will contain a succinct value suitable for machine interpretation. 215 - **FailureMessage [optional]** 216 - Type: `*string` 217 - Description: FailureMessage indicates that there is a problem reconciling the state, and will be 218 set to a descriptive error message. 219 - **Phase [optional]** 220 - Type: `string` 221 - Description: Phase represents the current phase of cluster actuation. 222 e.g. Pending, Running, Terminating, Failed etc. 223 - **BootstrapReady [optional]** 224 - Type: `bool` 225 - Description: True when the bootstrap provider status is ready. 226 - **InfrastructureReady [optional]** 227 - Type: `bool` 228 - Description: True when the infrastructure provider status is ready. 229 230 #### States and Transitions 231 232 ##### Pending 233 234 ``` go 235 // MachinePoolPhasePending is the first state a MachinePool is assigned by 236 // Cluster API MachinePool controller after being created. 237 MachinePoolPhasePending = MachinePoolPhase("pending") 238 ``` 239 240 ###### Transition Conditions 241 242 - MachinePool.Phase is empty 243 244 ###### Expectations 245 246 - When MachinePool.Spec.Template.Spec.Bootstrap.DataSecretName is: 247 - \<nil\>, expect the field to be set by an external controller. 248 - “” (empty string), expect the bootstrap step to be ignored. 249 - “...” (populated by user or from the bootstrap provider), expect the contents to be used by a 250 bootstrap or infra provider. 251 - When MachinePool.Spec.Template.Spec.InfrastructureRef is: 252 - \<nil\> or not found, expect InfrastructureRef will be set/found during subsequent requeue. 253 - “...” (populated by user) and found, expect the infrastructure provider is waiting for bootstrap 254 data to be ready. 255 - Found, expect InfrastructureRef to reference an object such as GoogleManagedInstanceGroup, 256 AWSAutoScaleGroup, or AzureVMSS. 257 258 ##### Provisioning 259 260 ``` go 261 // MachinePoolPhaseProvisioning is the state when the 262 // MachinePool infrastructure is being created. 263 MachinePoolPhaseProvisioning = MachinePoolPhase("provisioning") 264 ``` 265 266 ###### Transition Conditions 267 268 - MachinePool.Spec.Template.Spec.Bootstrap.ConfigRef -> Status.Ready is true 269 - MachinePool.Spec.Template.Spec.Bootstrap.DataSecretName is not \<nil\> 270 271 ###### Expectations 272 273 - MachinePool’s infrastructure to be in the process of being provisioned. 274 275 ##### Provisioned 276 277 ``` go 278 // MachinePoolPhaseProvisioned is the state when its 279 // infrastructure has been created and configured. 280 MachinePoolPhaseProvisioned = MachinePoolPhase("provisioned") 281 ``` 282 283 ###### Transition Conditions 284 285 - MachinePool.Spec.Template.Spec.InfrastructureRef -> Status.Ready is true 286 - MachinePool.Status.Replicas is synced from MachinePool.Spec.Template.Spec.InfrastructureRef -> Status.Replicas 287 288 ###### Expectations 289 290 - MachinePool’s infrastructure has been created and the compute resources are configured. 291 292 ##### Running 293 294 ``` go 295 // MachinePoolPhaseRunning is the MachinePool state when it has 296 // become a set of Kubernetes Nodes in a Ready state. 297 MachinePoolPhaseRunning = MachinePoolPhase("running") 298 ``` 299 300 ###### Transition Conditions 301 302 - Number of Kubernetes Nodes matching MachinePool.Spec.ProviderIDList in a Ready state equal to MachinePool.Spec.Replicas. 303 304 ###### Expectations 305 306 - MachinePool controller should set MachinePool.Status.NodeRefs. 307 308 ##### Deleting 309 310 ``` go 311 // MachinePoolPhaseDeleting is the MachinePool state when a delete 312 // request has been sent to the API Server, 313 // but its infrastructure has not yet been fully deleted. 314 MachinePoolPhaseDeleting = MachinePoolPhase("deleting") 315 ``` 316 317 ###### Transition Conditions 318 319 - MachinePool.ObjectMeta.DeletionTimestamp is not \<nil\> 320 321 ###### Expectations 322 323 - MachinePool’s resources (Bootstrap and InfrastructureRef) should be deleted first using cascading deletion. 324 325 ##### Failed 326 327 ``` go 328 // MachinePoolPhaseFailed is the MachinePool state when the system 329 // might require user intervention. 330 MachinePoolPhaseFailed = MachinePoolPhase("failed") 331 ``` 332 ###### Transition Conditions 333 334 - MachinePool.Status.FailureReason and/or MachinePool.Status.FailureMessage is populated. 335 336 ###### Expectations 337 338 - User intervention required. 339 340 #### Controller Collaboration Diagram 341 342  343 344 #### CABPK Changes 345 346 **The interaction between MachinePool <-> CABPK will be identical to Machine <-> CABPK except in the 347 following cases:** 348 349 - A KubeadmConfig will be shared by all instances in a MachinePool versus a KubeadmConfig per 350 Machine 351 - MachinePool is only supported for worker nodes, control plane support is not in scope 352 353 Additional details for **Support for MachinePool in CABPK** is captured in this 354 [issue](https://github.com/kubernetes-sigs/cluster-api/issues/1799). 355 356 ##### Bootstrap Token lifetimes 357 358 The bootstrap token TTL and renewal behavior will be carried over from CABPK's handling of 359 Machine. For those not familiar, there's a short 15m bootstrap token TTL to support infrastructure 360 provisioning that's periodically extended if the infrastructure provisioning doesn't complete within 361 the TTL. It's worthwhile to call out that extending the bootstrap token TTL will be leveraged by 362 MachinePool for scale up operations that occur after the initial TTL is exceeded. 363 364 In the future, bootstrap token handling might change once we [have a good story for injecting 365 secrets](https://github.com/kubernetes-sigs/cluster-api/issues/1739). 366 367 ### Risks and Mitigations 368 369 #### MachinePool type might not cover all potential infrastructure providers 370 371 MachinePool is initially designed to encompass commonality across AWS, GCP, and Azure. CAPI adopting 372 a provider-agnostic scaling type early will allow other providers to give feedback towards evolving 373 the type before beta and GA milestones where API changes become more difficult. 374 375 ##### Infrastructure Provider Features Required for MachinePool v1alpha3 376 377 - Target Capacity for the set/group 378 - Rolling update parameters 379 380 ##### Infrastructure Provider Features Potentially Required for MachinePool post-v1alpha3 381 382 - Min/Max machine replicas in a set/group (autoscaling) 383 384 ## Alternatives 385 386 Infrastructure Machine Controllers allocate from infrastructure provider specific scale group/set 387 resources. Some benefits of using provider specific scale group/set could be derived by this 388 approach, but it would be complex to manage. 389 390 ## Upgrade Strategy 391 392 NA as this proposal only adds new types. 393 394 ## Additional Details 395 396 ### Test Plan 397 398 TBD 399 400 ### Graduation Criteria 401 402 TBD 403 404 ## Drawbacks 405 406 ### Infrastructure Provider Behavior Differences 407 408 Subtle differences in how infrastructure provider scale resources are implemented might lead to an 409 inconsistent experience across providers. 410 411 ## Implementation History 412 413 09/18/2019: Proposed idea during cluster-api f2f 414 10/10/2019: Compile a Google Doc following the CAEP template 415 10/23/2019: First round of feedback from community 416 10/23/2019: Present proposal at a community meeting 417 10/31/2019: Open proposal PR