sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/proposals/20231122-automate-aks-features.md (about) 1 --- 2 title: Automate AKS Features Available in CAPZ 3 authors: 4 - "@nojnhuh" 5 reviewers: 6 - "@CecileRobertMichon" 7 - "@matthchr" 8 - "@dtzar" 9 - "@mtougeron" 10 creation-date: 2023-11-22 11 last-updated: 2023-11-28 12 status: provisional 13 see-also: 14 - "docs/proposals/20230123-azure-service-operator.md" 15 --- 16 17 # Automate AKS Features Available in CAPZ 18 19 ## Table of Contents 20 21 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 22 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 23 24 - [Summary](#summary) 25 - [Motivation](#motivation) 26 - [Goals](#goals) 27 - [Non-Goals/Future Work](#non-goalsfuture-work) 28 - [Proposal](#proposal) 29 - [User Stories](#user-stories) 30 - [Story 1](#story-1) 31 - [Story 2](#story-2) 32 - [Story 3](#story-3) 33 - [API Design Options](#api-design-options) 34 - [Option 1: CAPZ resource references an existing ASO resource](#option-1-capz-resource-references-an-existing-aso-resource) 35 - [Option 2: CAPZ resource references a non-functional ASO "template" resource](#option-2-capz-resource-references-a-non-functional-aso-template-resource) 36 - [Option 3: CAPZ resource defines an entire unstructured ASO resource inline](#option-3-capz-resource-defines-an-entire-unstructured-aso-resource-inline) 37 - [Option 4: CAPZ resource defines an entire typed ASO resource inline](#option-4-capz-resource-defines-an-entire-typed-aso-resource-inline) 38 - [Option 5: No change: CAPZ resource evolution proceeds the way it currently does](#option-5-no-change-capz-resource-evolution-proceeds-the-way-it-currently-does) 39 - [Option 6: Generate CAPZ code equivalent to what's added manually today](#option-6-generate-capz-code-equivalent-to-whats-added-manually-today) 40 - [Option 7: CAPZ resource defines patches to ASO resource](#option-7-capz-resource-defines-patches-to-aso-resource) 41 - [Option 8: Users bring-their-own ASO ManagedCluster resource](#option-8-users-bring-their-own-aso-managedcluster-resource) 42 - [Decision](#decision) 43 - [Security Model](#security-model) 44 - [Risks and Mitigations](#risks-and-mitigations) 45 - [Upgrade Strategy](#upgrade-strategy) 46 - [Additional Details](#additional-details) 47 - [Test Plan](#test-plan) 48 - [Graduation Criteria](#graduation-criteria) 49 - [Version Skew Strategy](#version-skew-strategy) 50 - [Implementation History](#implementation-history) 51 52 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 53 54 ## Summary 55 56 CAPZ's AzureManagedControlPlane and AzureManagedMachinePool resources expose AKS's managed cluster and agent 57 pool resources for Cluster API. Currently, new features in AKS require manual changes to teach CAPZ about 58 those features for users to be able to take advantage of them natively in Cluster API. As a result, there are 59 several AKS features available that cannot be used from Cluster API. This proposal describes how CAPZ will 60 automatically make all AKS features available on an ongoing basis with minimal maintenance. 61 62 ## Motivation 63 64 Historically, CAPZ has exposed an opinionated subset of AKS features that are tested and known to work within 65 the Cluster API ecosystem. Since then, it has become increasingly clear that new AKS features are generally 66 suitable to implement in CAPZ and users are interested in having all AKS features available to them from 67 Cluster API. 68 69 When gaps exist in the set of features available in AKS and what CAPZ offers, users of other infrastructure 70 management solutions may not be able to adopt CAPZ. If all AKS features could be used from CAPZ, this would 71 not be an issue. 72 73 The AKS feature set changes rapidly alongside CAPZ's users' desire to utilize those features. Because making 74 new AKS features available in CAPZ requires a considerable amount of mechanical, manual effort to implement, 75 review, and test, requests for new AKS features account for a large portion of the cost to maintain CAPZ. 76 77 Another long-standing feature request in CAPZ is the ability to adopt existing AKS clusters into management by 78 Cluster API (https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1173). Making the entire AKS 79 API surface area available from CAPZ is required to enable this so existing clusters' full representations can 80 be reflected in Cluster API. 81 82 ### Goals 83 84 - Narrow the gap between the sets of features offered by AKS and CAPZ. 85 - Reduce the maintenance cost of making new AKS features available from CAPZ. 86 - Preserve the behavior of existing CAPZ AKS definitions while allowing users to utilize the new API pattern 87 iteratively to use new features not currently implemented in CAPZ. 88 89 ### Non-Goals/Future Work 90 91 - Automate the features available from other Azure services besides AKS. 92 - Automatically modify AKS cluster definitions to transparently enable new AKS features. 93 94 ## Proposal 95 96 ### User Stories 97 98 #### Story 1 99 100 As a managed cluster user, I want to be able to use all available AKS features natively from CAPZ so that I 101 can more consistently manage my CAPZ AKS clusters that use advanced or niche features more quickly than having 102 to wait for each of them to be implemented in CAPZ. 103 104 #### Story 2 105 106 As an AKS user looking to adopt Cluster API over an existing infrastructure management solution, I want to be 107 able to use all AKS features natively from CAPZ so that I can adopt Cluster API with the confidence that all 108 the AKS features I currently utilize are still supported. 109 110 #### Story 3 111 112 As a CAPZ developer, I want to be able to make new AKS features available from CAPZ more easily in order to 113 meet user demand. 114 115 ### API Design Options 116 117 There are a few different ways the entire AKS API surface area could be exposed from the CAPZ API. The 118 following options all rely on ASO's ManagedCluster and ManagedClustersAgentPool resources to define the full 119 AKS API. The examples below use AzureManagedControlPlane and ManagedCluster to help illustrate, but all of the 120 same ideas should also apply to AzureManagedMachinePool and ManagedClustersAgentPool. 121 122 #### Option 1: CAPZ resource references an existing ASO resource 123 124 Here, the AzureManagedControlPlane spec would include only a field that references an ASO ManagedCluster 125 resource: 126 127 ```go 128 type AzureManagedControlPlaneSpec struct { 129 // ManagedClusterRef is a reference to the ASO ManagedCluster backing this AzureManagedControlPlane. 130 ManagedClusterRef corev1.ObjectReference `json:"managedClusterRef"` 131 } 132 ``` 133 134 CAPZ will _not_ create this ManagedCluster and instead rely on it being created by any other means. CAPZ's 135 `aks` flavor template will be updated to include a ManagedCluster to fulfill this requirement. CAPZ will also 136 not modify the ManagedCluster except to fulfill CAPI contracts, such as managing replica count. Users modify 137 other parameters which are not managed by CAPI on the ManagedCluster directly. Users should be fairly familiar 138 with this pattern since it is already used extensively throughout CAPI, e.g. by modifying a virtual machine 139 through its InfraMachine resource for parameters not defined on the Machine resource. 140 141 This approach has two key benefits. First, it can leverage ASO's conversion webhooks to allow CAPZ to interact 142 with the ManagedCluster through one version of the API, and users to use a different API version, including 143 newer or preview (https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2625) API versions. 144 Second, since ASO can [adopt existing Azure 145 resources](https://azure.github.io/azure-service-operator/guide/frequently-asked-questions/#how-can-i-import-existing-azure-resources-into-aso), 146 adopting existing AKS clusters into CAPZ 147 (https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1173) additionally would only require 148 the extra steps to create the AzureManagedControlPlane referring to the adopted ManagedCluster. 149 150 One other consideration is that this approach trades the requirement that CAPZ has the necessary RBAC 151 permissions to create ManagedClusters for users having that same capability. CAPZ would still require 152 permissions to read, update, and delete ManagedClusters. 153 154 Drawbacks with this approach include: 155 - A requirement to define ASO resources in templates in addition to the existing suite of CAPI/CAPZ resources 156 (one ASO ManagedCluster and one ASO ManagedClustersAgentPool for each MahcinePool). 157 - Inconsistency between how CAPZ manages some resources through references to pre-created ASO objects (the 158 managed cluster and agent pools) and some by creating the ASO objects on its own (resource group, virtual 159 network, subnets). 160 - An increased risk of users conflicting with CAPZ if users and CAPZ are both expected to modify mutually 161 exclusive sets of fields on the same resources. 162 - A new inability for CAPZ to automatically make adjustments to AKS-specific parameters that we may determine 163 are worthwhile to apply on behalf of users. 164 165 Other main roadblocks with this method relate to ClusterClass. If CAPZ's AzureManagedControlPlane controller is 166 not responsible for creating the ASO ManagedCluster resource, then users would need to manage those 167 separately, defeating much of the purpose of ClusterClass. Additionally, since each AzureManagedControlPlane 168 will be referring to a distinct ManagedCluster, the new `ManagedClusterRef` field should not be defined in an 169 AzureManagedControlPlaneTemplate. Upstream CAPI components could possibly be changed to better enable this use 170 case by allowing templating of arbitrary Kubernetes resources alongside other cluster resources. 171 172 #### Option 2: CAPZ resource references a non-functional ASO "template" resource 173 174 This method is similar to [Option 1]. To better enable ClusterClass without changes to CAPI, instead of 175 defining a full reference to the ManagedCluster resource, a reference to a "template" resource is defined 176 instead: 177 178 ```go 179 type AzureManagedControlPlaneClassSpec struct { 180 // ManagedClusterTemplateRef is a reference to the ASO ManagedCluster to be used as a template from which 181 // new ManagedClusters will be created. 182 ManagedClusterTemplateref corev1.ObjectReference `json:"managedClusterTemplateRef"` 183 } 184 ``` 185 186 This template resource will be a ManagedCluster used as a base from which the AzureManagedControlPlane 187 controller will create a new ManagedCluster. The template ManagedCluster will have ASO's [`skip` reconcile 188 policy](https://azure.github.io/azure-service-operator/guide/annotations/#serviceoperatorazurecomreconcile-policy) 189 applied so it does not result in any AKS resource being created in Azure. The ManagedClusters created based on 190 the template will be reconciled normally to create AKS resources in Azure. The non-template ManagedClusters 191 will be linked to the AzureManagedControlPlane through the standard `cluster.x-k8s.io/cluster-name` label. 192 193 To modify parameters on the AKS cluster, either the template or non-template ManagedCluster may be updated. 194 CAPZ will propagate changes made to a template to instances of that template. Parameters defined on the 195 template take precedence over the same parameters on the instances. 196 197 The main difference with [Option 1] that enables ClusterClass is that the same ManagedCluster template 198 resource can be referenced by multiple AzureManagedControlPlanes, so this new `ManagedClusterTemplateRef` 199 field can be defined on the AzureManagedControlPlaneClassSpec so a set of AKS parameters defined once in the 200 template can be applied to all Clusters built from that ClusterClass. 201 202 This method makes all ManagedCluster fields available to define in a template which could lead to 203 misconfiguration if certain parameters that must be unique to a cluster are erroneously shared through a 204 template. Since those fields cannot be automatically identified and may evolve between AKS API versions, CAPZ 205 will not attempt to categorize ASO ManagedCluster fields that way like it does between the fields present and 206 omitted from the `AzureClusterClassSpec` type, for example. CAPZ could document a best-effort list of known 207 fields which could or should not be defined in template types and will otherwise rely on AKS to provide 208 reasonable error messages for misconfigurations. 209 210 Like [Option 1], this method keeps particular versions of CAPZ decoupled from particular API versions of ASO 211 resources (including allowing preview versions) and opens the door for streamlined adoption of existing AKS 212 clusters. 213 214 #### Option 3: CAPZ resource defines an entire unstructured ASO resource inline 215 216 This method is functionally equivalent to [Option 2] except that the template resource is defined inline 217 within the AzureManagedControlPlane: 218 219 ```go 220 type AzureManagedControlPlaneClassSpec struct { 221 // ManagedClusterTemplate is the ASO ManagedCluster to be used as a template from which new 222 // ManagedClusters will be created. 223 ManagedClusterTemplate map[string]interface{} `json:"managedClusterTemplate"` 224 } 225 ``` 226 227 One variant of this method could be to using `string` instead of `map[string]interface{}` for the template 228 type, though that would make defining patches unwieldy (like for ClusterClass). 229 230 Compared to [Option 2], this method loses schema and webhook validation that would be performed by ASO when 231 creating a separate ManagedCluster to serve as a template. That validation would still be performed when CAPZ 232 creates the ManagedCluster resource, but that would be some time after the AzureManagedControlPlane is created 233 and error messages may not be quite as visible. 234 235 #### Option 4: CAPZ resource defines an entire typed ASO resource inline 236 237 This method is functionally equivalent to [Option 3] except that the template field's type is the exact same 238 as an ASO ManagedCluster: 239 240 ```go 241 type AzureManagedControlPlaneClassSpec struct { 242 // ManagedClusterSpec defines the spec of the ASO ManagedCluster managed by this AzureManagedControlPlane. 243 ManagedClusterSpec v1api20230201.ManagedCluster_Spec `json:"managedClusterSpec"` 244 } 245 ``` 246 247 This method allows CAPZ to leverage schema validation defined by ASO's CRDs upon AzureManagedControlPlane 248 creation, but would still lose any further webhook validation done by ASO unless CAPZ can invoke that itself. 249 250 It also has the drawback that one version of CAPZ is tied directly to a single AKS API version. The spec could 251 potentially contain separate fields for each API version and enforce in webhooks that only one API version is 252 being used at a time. Alternatively, users may set fields only present in a newer API version directly on the 253 ManagedCluster after creation (if allowed by AKS) because CAPZ will not override user-provided fields for 254 which it does not have its own opinion on ASO resources. 255 256 Updating the embedded ASO API version in the CAPZ resources may not be possible to do safely without also 257 bumping the CAPZ API version, however. Because ASO implements conversion webhook logic between several API 258 versions for each AKS resource type, simply bumping the ASO API version in the CAPZ type without bumping the 259 CAPZ API version would not allow that same conversion to be applied. This could lead to issues where a new 260 version of CAPZ suddenly starts constructing invalid ASO resources and user intervention is required to 261 perform the conversion manually. 262 263 While it couples CAPZ to one ASO API version, this approach allows CAPZ to move a more calculated pace with 264 regards to AKS API versions the way it's done today. This also narrows CAPZ's scope of responsibility which 265 reduces CAPZ's exposure to potential incompatibilities with certain ASO API versions. 266 267 Regarding ClusterClass, this option functions the same as [Option 2] or [Option 3], where all ASO fields can 268 be defined in a template. This option opens up an additional safeguard though, where webhooks could flag 269 fields which should not be defined in a template. Similar webhook checks would be less practical in the other 270 options where CAPZ would need to be aware of the set of disallowed fields for each ASO API version that a user 271 could use. 272 273 Similarly, CAPZ's webhooks are also better able to validate and default the ASO configuration and ensure 274 fields like ManagedCluster's `spec.owner` that should not be modified by users are set correctly. 275 276 #### Option 5: No change: CAPZ resource evolution proceeds the way it currently does 277 278 This method describes not making any changes to how the CAPZ API is generally structured for AKS resources. 279 CAPZ API types will continue to be curated manually without inheriting anything from the ASO API. 280 281 Benefits of continuing on our current path include: 282 - Familiarity with the existing pattern by users and contributors 283 - Zero up-front cost to implement or transition to a new pattern 284 - No requirement for users to have ASO knowledge 285 - Greater freedom to change API implementations which we've recently leveraged to transition between the older 286 and newer Azure SDKs and to ASO. 287 288 #### Option 6: Generate CAPZ code equivalent to what's added manually today 289 290 For this option, a new code generation pipeline would be created to automatically scaffold the code that is 291 currently manually written to expose additional AKS API fields to CAPZ. 292 293 Once implemented, this method would drastically reduce the amount of developer effort to get started adding 294 new AKS features. There would continue to be some amount of ongoing cost though to identify and handle nuances 295 for each feature though, which may include issues outside of CAPZ's control like AKS API quirks. 296 297 The main drawback of this approach from a developer perspective is the up-front effort to implement and 298 ongoing cost to maintain the code generation itself. The existing AKS feature set exposed by CAPZ provides a 299 decent foundation that can help identify regressions by testing the pipeline against features already 300 implemented and tested. ASO also already implements a full code generation pipeline to transform Azure API 301 specs into Kubernetes resource definitions, so some of that could possibly be reused by CAPZ. 302 303 #### Option 7: CAPZ resource defines patches to ASO resource 304 305 This method describes adding a new `spec.asoManagedClusterPatches` field to the existing 306 AzureManagedControlPlane: 307 308 ```go 309 type AzureManagedControlPlaneClassSpec struct { 310 ... 311 312 // ASOManagedClusterPatches defines patches to be applied to the generated ASO ManagedCluster resource. 313 ASOManagedClusterPatches []string `json:"asoManagedClusterPatches,omitempty"` 314 } 315 ``` 316 317 After CAPZ calculates the ASO ManagedCluster for a given AzureManagedControlPlane during a reconciliation, it 318 will apply these patches in order to the ManagedCluster before ultimately submitting that to ASO. This allows 319 users to specify any AKS feature declaratively within the existing CAPZ spec. The exact format of the patches 320 is TBD, but likely one or all of the formats described here: 321 https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/ 322 323 The main drawback of this approach is the fragility of the patches themselves, which can break underneath 324 users with a new version of ASO or CAPZ. There is also increased risk that a user could modify the ASO 325 resource in a way that breaks CAPZ's ability to reconcile it, though CAPZ could perhaps add some extra 326 validation that particularly sensitive fields do not get modified by a set of patches. Compared to other 327 options, the syntax for specifying a patch is also more cumbersome than if its equivalent had its own CAPZ API 328 field. Given the intent of setting patches is to enable niche use-cases for advanced users, these drawbacks 329 may be acceptable. 330 331 #### Option 8: Users bring-their-own ASO ManagedCluster resource 332 333 CAPZ can already incorporate existing ASO resources into Cluster API Cluster configurations. BYO ASO resources 334 are never modified or deleted by CAPZ directly but are still read and play in to the status of CAPZ resources. 335 336 The flow for this method is roughly: 337 1. User creates an ASO ManagedCluster before or at the same time as the other Cluster API and CAPZ resources. 338 1. Any updates to the AKS cluster are done by the user through the ASO resource. 339 1. The user may control whether or not the ASO ManagedCluster gets deleted with the rest of the Cluster API 340 Cluster by choosing whether or not to pre-create the ASO ResourceGroup. 341 342 Benefits of this approach: 343 - Minimal or zero additional cost for CAPZ to implement outside of a new test. Users can try this today with 344 the latest version of CAPZ. 345 - Freedom for users to craft the ASO ManagedCluster and other resources exactly how they like, even a 346 different API version (including preview). 347 - Protection from CAPZ being responsible for managing AKS configuration that it does not understand. 348 349 Drawbacks: 350 - Requires users to mimic CAPZ and create the ManagedCluster with `spec.agentPoolProfiles` as required by AKS, 351 then remove them once created so as not to conflict with the corresponding AzureManagedMachinePool 352 definitions, which can't be done in a single install operation. 353 - Requires users to rework templates to add ASO resources, but only if they want AKS features not exposed by 354 CAPZ. 355 - Doesn't support ClusterClass. 356 - CAPZ may no longer be able to fulfill CAPI's contract by modifying fields like Kubernetes version or replica 357 counts as defined by a MachinePool. Internal changes to CAPZ may be able to overcome this. 358 359 #### Decision 360 361 We are moving forward with [Option 7] for now as it requires the least amount of change to the CAPZ API and 362 does not introduce any significant usability issues while still allowing users to declaratively enable 363 features outside of those explicitly defined by CAPZ. 364 365 ### Security Model 366 367 One possible concern is that with little oversight as to what fields defined in the AKS API ultimately become 368 exposed in the CAPZ API, bad actors may become able to modify certain sensitive configuration by way of CAPZ's 369 AzureManagedControlPlane which was not possible before. However, CAPZ has historically not forbidden workload 370 cluster administrators from modifying any such sensitive configuration in the past. 371 372 Overall, none of the approaches outlined above change what data is ultimately represented in the API, only the 373 higher-level shape of the API. That means there is no further transport or handling of secrets or other 374 sensitive information beyond what CAPZ already does. 375 376 ### Risks and Mitigations 377 378 Increasing CAPZ's reliance on ASO and exposing ASO to users at the API level further increases the risk 379 [previously discussed](https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/ce3c130266b23a8b67aa5ef9a21f257ff9e6d63e/docs/proposals/20230123-azure-service-operator.md?plain=1#L169) 380 that since ASO has not yet been proven to be as much of a staple as the other projects that manage 381 infrastructure on Azure, ASO's lifespan may be more limited than others. If ASO were to sunset while CAPZ 382 still relies on it, CAPZ would have to rework its APIs. This risk is mitigated by the fact that no 383 announcements have yet been made regarding ASO's end-of-life and the project continues to be very active. And 384 since ASO's resource representations are mostly straightforward reflections of the Azure API spec, shifting 385 away from ASO to another Azure API abstraction should be mostly mechanical. 386 387 ## Upgrade Strategy 388 389 Each of the first four [options above](#api-design-options) would existing in a new v2alpha1 CAPZ API version for 390 AzureManagedControlPlane and AzureManagedMachinePool. The existing v1beta1 types will continue to be served so 391 users do not need to take any action for their existing clusters to continue to function as they have been. 392 393 An alternative available with [Option 3] and [Option 4] is to introduce a new backwards-compatible CAPZ API 394 version for AzureManagedControlPlane and AzureManagedMachinePool, such as v1beta2. Then, conversion webhooks 395 can be implemented to convert between v1beta1 and v1beta2. This option isn't possible with [Option 1] or 396 [Option 2] because a conversion webhook would not be able to create the new standalone ASO resources. 397 398 ## Additional Details 399 400 ### Test Plan 401 402 Existing end-to-end tests will verify that CAPZ's current behavior does not regress. New tests will verify 403 that the new API fields proposed here behave as expected. 404 405 ### Graduation Criteria 406 407 Any new CAPZ API versions or added API fields would be available and enabled by default as soon as they are 408 functional and stable enough for users to try. Requirements for a v2alpha1 to v2beta1 graduation and 409 deprecation of the v1 API are TBD. 410 411 ### Version Skew Strategy 412 413 With options 1-3 above, users are free to use any ASO API versions for AKS resources. CAPZ may internally 414 operate against a different API version and ASO's webhooks will transparently perform any necessary conversion 415 between versions. 416 417 ## Implementation History 418 419 - [ ] 06/15/2023: Issue opened: https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3629 420 - [ ] 11/22/2023: Iteration begins on this proposal document 421 - [ ] 11/28/2023: First complete draft of this document is made available for review 422 423 [Option 1]: #option-1-capz-resource-references-an-existing-aso-resource 424 [Option 2]: #option-2-capz-resource-references-a-non-functional-aso-template-resource 425 [Option 3]: #option-3-capz-resource-defines-an-entire-unstructured-aso-resource-inline 426 [Option 4]: #option-4-capz-resource-defines-an-entire-typed-aso-resource-inline 427 [Option 5]: #option-5-no-change-capz-resource-evolution-proceeds-the-way-it-currently-does 428 [Option 7]: #option-7-capz-resource-defines-patches-to-aso-resource