sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20190919-machinepool-api.md (about) 1 --- 2 title: MachinePool API 3 authors: 4 - "@juan-lee" 5 - "@CecileRobertMichon" 6 reviewers: 7 - "@detiber" 8 - "@justaugustus" 9 - "@ncdc" 10 - "@vincepri" 11 creation-date: 2019-09-19 12 last-updated: 2019-11-24 13 replaces: https://docs.google.com/document/d/1nbOqCIC0-ezdMXubZIV6EQrzD0QYPrpcdCBB4oSjWeQ 14 status: provisional 15 --- 16 17 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 18 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 19 20 - [MachinePool API](#machinepool-api) 21 - [Glossary](#glossary) 22 - [Summary](#summary) 23 - [Motivation](#motivation) 24 - [Goals](#goals) 25 - [Non-goals/Future Work](#non-goalsfuture-work) 26 - [Proposal](#proposal) 27 - [User Stories](#user-stories) 28 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 29 - [Data Model Changes](#data-model-changes) 30 - [States and Transitions](#states-and-transitions) 31 - [Pending](#pending) 32 - [Transition Conditions](#transition-conditions) 33 - [Expectations](#expectations) 34 - [Provisioning](#provisioning) 35 - [Transition Conditions](#transition-conditions-1) 36 - [Expectations](#expectations-1) 37 - [Provisioned](#provisioned) 38 - [Transition Conditions](#transition-conditions-2) 39 - [Expectations](#expectations-2) 40 - [Running](#running) 41 - [Transition Conditions](#transition-conditions-3) 42 - [Expectations](#expectations-3) 43 - [Deleting](#deleting) 44 - [Transition Conditions](#transition-conditions-4) 45 - [Expectations](#expectations-4) 46 - [Failed](#failed) 47 - [Transition Conditions](#transition-conditions-5) 48 - [Expectations](#expectations-5) 49 - [Controller Collaboration Diagram](#controller-collaboration-diagram) 50 - [CABPK Changes](#cabpk-changes) 51 - [Bootstrap Token lifetimes](#bootstrap-token-lifetimes) 52 - [Risks and Mitigations](#risks-and-mitigations) 53 - [MachinePool type might not cover all potential infrastructure providers](#machinepool-type-might-not-cover-all-potential-infrastructure-providers) 54 - [Infrastructure Provider Features Required for MachinePool v1alpha3](#infrastructure-provider-features-required-for-machinepool-v1alpha3) 55 - [Infrastructure Provider Features Potentially Required for MachinePool post-v1alpha3](#infrastructure-provider-features-potentially-required-for-machinepool-post-v1alpha3) 56 - [Alternatives](#alternatives) 57 - [Upgrade Strategy](#upgrade-strategy) 58 - [Additional Details](#additional-details) 59 - [Test Plan](#test-plan) 60 - [Graduation Criteria](#graduation-criteria) 61 - [Drawbacks](#drawbacks) 62 - [Infrastructure Provider Behavior Differences](#infrastructure-provider-behavior-differences) 63 - [Implementation History](#implementation-history) 64 65 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 66 67 68 # MachinePool API 69 70 ## Glossary 71 The lexicon used in this document is described in more detail 72 [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/reference/glossary.md). 73 Any discrepancies should be rectified in the main Cluster API glossary. 74 75 - **ASG** - AWS Auto Scale Group 76 - **MIG** - GCP Managed Instance Group 77 - **VMSS** - Azure Virtual Machine Scale Set 78 79 ## Summary 80 81 In Cluster API (CAPI) v1alpha2, users can create MachineDeployment, MachineSet or Machine custom 82 resources. When you create a MachineDeployment or MachineSet, Cluster API components react and 83 eventually Machine resources are created. Cluster API's current architecture mandates that a Machine 84 maps to a single machine (virtual or bare metal) with the provider being responsible for the 85 management of the underlying machine's infrastructure. 86 87 Nearly all infrastructure providers have a way for their users to manage a group of machines 88 (virtual or bare metal) as a single entity. Each infrastructure provider offers their own unique 89 features, but nearly all are concerned with managing availability, health, and configuration 90 updates. 91 92 This proposal outlines adding a MachinePool API (type/controller) for managing many machines as a 93 single entity. A MachinePool is similar to a MachineDeployment in that they both define 94 configuration and policy for how a set of machines are managed. They Both define a common 95 configuration, number of desired machine replicas, and policy for update. Both types also combine 96 information from Kubernetes as well as the underlying provider infrastructure to give a view of the 97 overall health of the machines in the set. 98 99 MachinePool diverges from MachineDeployment in that the MachineDeployment controller uses 100 MachineSets to achieve the aforementioned desired number of machines and to orchestrate updates to 101 the Machines in the managed set, while MachinePool delegates the responsibility of these concerns to 102 an infrastructure provider specific resource such as AWS Auto Scale Groups, GCP Managed Instance 103 Groups, and Azure Virtual Machine Scale Sets. 104 105 MachinePool is optional and doesn't replace the need for MachineSet/Machine since not every 106 infrastructure provider will have an abstraction for managing multiple machines (i.e. bare metal). 107 Users may always opt to choose MachineSet/Machine when they don't see additional value in 108 MachinePool for their use case. 109 110 ## Motivation 111 112 Infrastructure providers have invested a significant amount of time optimizing the way users manage 113 sets of machines as a single entity. The interface exposed by each infrastructure provider has a lot 114 of commonalities with the MachineDeployment type. Allowing users of CAPI to leverage the 115 optimizations exposed by each infrastructure provider could prove beneficial. 116 117 **Potential benefits include:** 118 - Faster machine provisioning 119 - Improved provisioning success rates 120 - Automatic distribution of machines across availability zones if supported by the infrastructure 121 provider 122 - CAPI initiated rolling update of machines 123 - Higher maximum machines in a cluster (Azure limitations) 124 - Auto-scaling 125 126 ### Goals 127 128 - To expose the MachinePool API for infrastructure providers to leverage their optimizations around managing large sets of machines. 129 - Support for user initiated scale up/down. 130 - Support for declarative rolling update. 131 132 ### Non-goals/Future Work 133 134 - To support enabling infrastructure provider specific autoscalers. (at least in v1alpha3) 135 - To support cordon/drain during infrastructure provider specific rolling update. 136 - To manage control plane nodes with the MachinePool API. 137 - To integrate MachinePool with the kubernetes cluster autoscaler. 138 139 ## Proposal 140 141 This proposal introduces the MachinePool API for the purpose of delegating the management of pools 142 of machines to infrastructure provider supplied controllers. 143 144 ### User Stories 145 146 - As an infrastructure provider author, I would like to build a controller to manage multiple 147 machines with a common configuration using my provider specific resource for doing so. 148 - As a cluster operator, I would like to use MachinePool, similar to how I'm using MachineDeployment 149 today, to manage a set of machines with a common configuration. 150 151 ### Implementation Details/Notes/Constraints 152 153 #### Data Model Changes 154 155 MachinePool Spec and Status introduces the integration point for delegating the management of a set 156 of machines to the infrastructure provider. Many of the fields are shared with MachineDeployment due 157 to infrastructure provider's desire to enable the management of a set of machines with a single 158 configuration. 159 160 ``` go 161 type MachinePoolSpec struct 162 ``` 163 164 - **To add** 165 - **ClusterName [required]** 166 - Type: `string` 167 - Description: Name of the Cluster this machine pool belongs to. 168 - **FailureDomains [optional]** 169 - Type: `[]string` 170 - Description: FailureDomains is the list of failure domains this MachinePool should be attached to. 171 - **Replicas [optional]** 172 - Type: `*int32` 173 - Description: Number of desired machine instances. Defaults to 1. 174 - **Template [required]** 175 - Type: `MachineTemplateSpec` 176 - Description: Machine Template that describes the configuration of each machine instance in a 177 machine pool. 178 - **MinReadySeconds [optional]** 179 - Type: `*int32` 180 - Description: Minimum number of seconds for which a newly created machine should be ready. 181 - **ProviderIDList [optional]** 182 - Type: `[]string` 183 - Description: ProviderIDList contain a ProviderID for each machine instance that's currently 184 managed by the infrastructure provider belonging to the machine pool. 185 186 ``` go 187 type MachinePoolStatus struct 188 ``` 189 190 - **To add** 191 - **NodeRefs [optional]** 192 - Type: `[]corev1.ObjectReference` 193 - Description: NodeRefs contain a NodeRef for each ProviderID in MachinePoolSpec.ProviderIDList. 194 - **Replicas [optional]** 195 - Type: `*int32` 196 - Description: Replicas is the most recent observed number of replicas. 197 - **ReadyReplicas [optional]** 198 - Type: `*int32` 199 - Description: The number of ready replicas for this MachinePool. 200 - **AvailableReplicas [optional]** 201 - Type: `*int32` 202 - Description: The number of available replicas (ready for at least minReadySeconds) for this 203 MachinePool. 204 - **UnavailableReplicas [optional]** 205 - Type: `*int32` 206 - Description: Total number of unavailable machines targeted by this machine pool. This is the 207 total number of machines that are still required for this machine pool to have 100% available 208 capacity. They may either be machines that are running but not yet available or machines that 209 still have not been created. 210 - **FailureReason [optional]** 211 - Type: `*capierrors.MachinePoolStatusError` 212 - Description: FailureReason will be set in the event that there is a terminal problem 213 reconciling the MachinePool and will contain a succinct value suitable for machine interpretation. 214 - **FailureMessage [optional]** 215 - Type: `*string` 216 - Description: FailureMessage indicates that there is a problem reconciling the state, and will be 217 set to a descriptive error message. 218 - **Phase [optional]** 219 - Type: `string` 220 - Description: Phase represents the current phase of cluster actuation. 221 e.g. Pending, Running, Terminating, Failed etc. 222 - **BootstrapReady [optional]** 223 - Type: `bool` 224 - Description: True when the bootstrap provider status is ready. 225 - **InfrastructureReady [optional]** 226 - Type: `bool` 227 - Description: True when the infrastructure provider status is ready. 228 229 #### States and Transitions 230 231 ##### Pending 232 233 ``` go 234 // MachinePoolPhasePending is the first state a MachinePool is assigned by 235 // Cluster API MachinePool controller after being created. 236 MachinePoolPhasePending = MachinePoolPhase("pending") 237 ``` 238 239 ###### Transition Conditions 240 241 - MachinePool.Phase is empty 242 243 ###### Expectations 244 245 - When MachinePool.Spec.Template.Spec.Bootstrap.DataSecretName is: 246 - \<nil\>, expect the field to be set by an external controller. 247 - “” (empty string), expect the bootstrap step to be ignored. 248 - “...” (populated by user or from the bootstrap provider), expect the contents to be used by a 249 bootstrap or infra provider. 250 - When MachinePool.Spec.Template.Spec.InfrastructureRef is: 251 - \<nil\> or not found, expect InfrastructureRef will be set/found during subsequent requeue. 252 - “...” (populated by user) and found, expect the infrastructure provider is waiting for bootstrap 253 data to be ready. 254 - Found, expect InfrastructureRef to reference an object such as GoogleManagedInstanceGroup, 255 AWSAutoScaleGroup, or AzureVMSS. 256 257 ##### Provisioning 258 259 ``` go 260 // MachinePoolPhaseProvisioning is the state when the 261 // MachinePool infrastructure is being created. 262 MachinePoolPhaseProvisioning = MachinePoolPhase("provisioning") 263 ``` 264 265 ###### Transition Conditions 266 267 - MachinePool.Spec.Template.Spec.Bootstrap.ConfigRef -> Status.Ready is true 268 - MachinePool.Spec.Template.Spec.Bootstrap.DataSecretName is not \<nil\> 269 270 ###### Expectations 271 272 - MachinePool’s infrastructure to be in the process of being provisioned. 273 274 ##### Provisioned 275 276 ``` go 277 // MachinePoolPhaseProvisioned is the state when its 278 // infrastructure has been created and configured. 279 MachinePoolPhaseProvisioned = MachinePoolPhase("provisioned") 280 ``` 281 282 ###### Transition Conditions 283 284 - MachinePool.Spec.Template.Spec.InfrastructureRef -> Status.Ready is true 285 - MachinePool.Status.Replicas is synced from MachinePool.Spec.Template.Spec.InfrastructureRef -> Status.Replicas 286 287 ###### Expectations 288 289 - MachinePool’s infrastructure has been created and the compute resources are configured. 290 291 ##### Running 292 293 ``` go 294 // MachinePoolPhaseRunning is the MachinePool state when it has 295 // become a set of Kubernetes Nodes in a Ready state. 296 MachinePoolPhaseRunning = MachinePoolPhase("running") 297 ``` 298 299 ###### Transition Conditions 300 301 - Number of Kubernetes Nodes matching MachinePool.Spec.ProviderIDList in a Ready state equal to MachinePool.Spec.Replicas. 302 303 ###### Expectations 304 305 - MachinePool controller should set MachinePool.Status.NodeRefs. 306 307 ##### Deleting 308 309 ``` go 310 // MachinePoolPhaseDeleting is the MachinePool state when a delete 311 // request has been sent to the API Server, 312 // but its infrastructure has not yet been fully deleted. 313 MachinePoolPhaseDeleting = MachinePoolPhase("deleting") 314 ``` 315 316 ###### Transition Conditions 317 318 - MachinePool.ObjectMeta.DeletionTimestamp is not \<nil\> 319 320 ###### Expectations 321 322 - MachinePool’s resources (Bootstrap and InfrastructureRef) should be deleted first using cascading deletion. 323 324 ##### Failed 325 326 ``` go 327 // MachinePoolPhaseFailed is the MachinePool state when the system 328 // might require user intervention. 329 MachinePoolPhaseFailed = MachinePoolPhase("failed") 330 ``` 331 ###### Transition Conditions 332 333 - MachinePool.Status.FailureReason and/or MachinePool.Status.FailureMessage is populated. 334 335 ###### Expectations 336 337 - User intervention required. 338 339 #### Controller Collaboration Diagram 340 341  342 343 #### CABPK Changes 344 345 **The interaction between MachinePool <-> CABPK will be identical to Machine <-> CABPK except in the 346 following cases:** 347 348 - A KubeadmConfig will be shared by all instances in a MachinePool versus a KubeadmConfig per 349 Machine 350 - MachinePool is only supported for worker nodes, control plane support is not in scope 351 352 Additional details for **Support for MachinePool in CABPK** is captured in this 353 [issue](https://github.com/kubernetes-sigs/cluster-api/issues/1799). 354 355 ##### Bootstrap Token lifetimes 356 357 The bootstrap token TTL and renewal behavior will be carried over from CABPK's handling of 358 Machine. For those not familiar, there's a short 15m bootstrap token TTL to support infrastructure 359 provisioning that's periodically extended if the infrastructure provisioning doesn't complete within 360 the TTL. It's worthwhile to call out that extending the bootstrap token TTL will be leveraged by 361 MachinePool for scale up operations that occur after the initial TTL is exceeded. 362 363 In the future, bootstrap token handling might change once we [have a good story for injecting 364 secrets](https://github.com/kubernetes-sigs/cluster-api/issues/1739). 365 366 ### Risks and Mitigations 367 368 #### MachinePool type might not cover all potential infrastructure providers 369 370 MachinePool is initially designed to encompass commonality across AWS, GCP, and Azure. CAPI adopting 371 a provider-agnostic scaling type early will allow other providers to give feedback towards evolving 372 the type before beta and GA milestones where API changes become more difficult. 373 374 ##### Infrastructure Provider Features Required for MachinePool v1alpha3 375 376 - Target Capacity for the set/group 377 - Rolling update parameters 378 379 ##### Infrastructure Provider Features Potentially Required for MachinePool post-v1alpha3 380 381 - Min/Max machine replicas in a set/group (autoscaling) 382 383 ## Alternatives 384 385 Infrastructure Machine Controllers allocate from infrastructure provider specific scale group/set 386 resources. Some benefits of using provider specific scale group/set could be derived by this 387 approach, but it would be complex to manage. 388 389 ## Upgrade Strategy 390 391 NA as this proposal only adds new types. 392 393 ## Additional Details 394 395 ### Test Plan 396 397 TBD 398 399 ### Graduation Criteria 400 401 TBD 402 403 ## Drawbacks 404 405 ### Infrastructure Provider Behavior Differences 406 407 Subtle differences in how infrastructure provider scale resources are implemented might lead to an 408 inconsistent experience across providers. 409 410 ## Implementation History 411 412 09/18/2019: Proposed idea during cluster-api f2f 413 10/10/2019: Compile a Google Doc following the CAEP template 414 10/23/2019: First round of feedback from community 415 10/23/2019: Present proposal at a community meeting 416 10/31/2019: Open proposal PR