sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20220221-runtime-SDK.md (about) 1 --- 2 title: Cluster API Runtime SDK 3 authors: 4 - "@fabriziopandini" 5 - "@sbueringer" 6 - "@vincepri" 7 reviewers: 8 - "@CecileRobertMichon" 9 - "@enxebre" 10 - "@ykakarap" 11 - “@killianmuldoon" 12 - "@shysank" 13 - "@devigned" 14 - "@alexander-demichev" 15 creation-date: 2022-02-21 16 last-updated: 2022-04-01 17 status: implementable 18 see-also: 19 - 20 replaces: 21 - 22 superseded-by: 23 - 24 --- 25 26 # Cluster API Runtime SDK 27 28 ## Table of Contents 29 30 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 31 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 32 33 - [Glossary](#glossary) 34 - [Summary](#summary) 35 - [Motivation](#motivation) 36 - [Goals](#goals) 37 - [Non-Goals](#non-goals) 38 - [Future Work](#future-work) 39 - [Proposal](#proposal) 40 - [User Stories](#user-stories) 41 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 42 - [Cluster API Runtime Hooks vs Kubernetes admission webhooks](#cluster-api-runtime-hooks-vs-kubernetes-admission-webhooks) 43 - [Runtime SDK rules](#runtime-sdk-rules) 44 - [Runtime Extensions developer guide](#runtime-extensions-developer-guide) 45 - [Registering Runtime Extensions](#registering-runtime-extensions) 46 - [Runtime Hooks developer guide (CAPI internals)](#runtime-hooks-developer-guide-capi-internals) 47 - [Runtime hook implementation](#runtime-hook-implementation) 48 - [Discovering Runtime Extensions](#discovering-runtime-extensions) 49 - [Calling Runtime Extensions](#calling-runtime-extensions) 50 - [Security Model](#security-model) 51 - [Risks and Mitigations](#risks-and-mitigations) 52 - [Alternatives](#alternatives) 53 - [Upgrade Strategy](#upgrade-strategy) 54 - [Additional Details](#additional-details) 55 - [Test Plan](#test-plan) 56 - [Graduation Criteria](#graduation-criteria) 57 - [Version Skew Strategy](#version-skew-strategy) 58 - [Annex](#annex) 59 - [Runtime SDK rules](#runtime-sdk-rules-1) 60 - [Discovery hook](#discovery-hook) 61 - [Implementation History](#implementation-history) 62 63 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 64 65 ## Glossary 66 67 Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). 68 69 - **Cluster API Runtime**: identifies the Cluster API execution model, a set of controllers cooperating in managing the 70 workload cluster’s lifecycle. 71 - **Runtime SDK**: a set of rules, recommendations and fundamental capabilities required to develop Runtime Hooks and 72 Runtime Extensions. 73 - **Runtime Hook**: a single, well identified, extension point allowing applications built on top of Cluster API to hook 74 into specific moments of the workload cluster’s lifecycle, e.g. `BeforeClusterUpgrade`, `BeforeMachineRemediation`. 75 - **Runtime Extension**: an external component which is part of a system/product built on top of Cluster API that can 76 handle requests for a specific Runtime Hook. 77 - **Runtime Extension Provider**: a project that provides a runtime extension and the yaml for installing it as part of 78 its release artefacts. 79 80 ## Summary 81 82 This proposal introduces the Cluster API Runtime SDK, a set of rules, recommendations, and fundamental capabilities 83 required to implement a new extensibility mechanism that allows systems, products, and services built on top of 84 Cluster API to hook into a workload cluster’s lifecycle. 85 86 ## Motivation 87 88 Extensibility is at the core of Cluster API. 89 90 CAPI extensibility originally was designed to allow infrastructure providers to offer their services via the Cluster API 91 declarative model; over time the same model has been extended to support bootstrap providers, control plane providers, 92 and more recently external remediation strategies. 93 94 But this is not enough anymore. 95 96 All the above extensibility points are about allowing plug-in, swappable “low-level” components required to 97 provision/manage a Kubernetes cluster with Cluster API. 98 Instead, with the growing adoption of Cluster API as a common layer to manage fleets of Kubernetes Clusters, there is 99 now a new category of systems, products and services built on top of Cluster API that require strict interactions 100 with the lifecycle of Clusters, but at the same time they do not want to replace any “low-level” components in 101 Cluster API, because they happily benefit from all the features available in the existing providers (built on top vs 102 plug-in/swap). 103 104 A common approach for this problem has been to watch for Cluster API resources; another approach has been to implement 105 API Server admission webhooks to alter CAPI resources, but both approaches are limited by the fact that the system 106 built on top of Cluster API is forced to treat it as a opaque system and thus with limited visibility and almost 107 total lack of control, e.g. you can watch a Machine being provisioned, but not block the provisioning to start if 108 a quota management systems signals you have exhausted all the resources assigned to you. 109 110 A stop-gap solution to this problem has been introduced in Cluster API with the implementation of machine deletion 111 hooks, but this approach is tightly linked to the specific use case and it can not be re-used in other contexts/in 112 other lifecycle moments. 113 114 This proposal aims to solve the above problem in a more structured and generic way, by introducing the Runtime SDK, 115 a set of rules, recommendations and fundamental capabilities required to implement a new extensibility mechanism 116 that will allow systems, products and services built on top of Cluster API to hook in the workload cluster’s 117 lifecycle. 118 119 The key elements of the above extensibility mechanism are Runtime Hooks and Runtime Extensions. 120 121 Runtime Hooks and Runtime Extensions are designed to be powerful and flexible, and _by opportunity_ it will be also 122 possible to use this capability for allowing the user to hook into Cluster API reconcile loops at "low level", e.g. 123 by allowing a Runtime Extension providing external patches to be executed on every topology reconcile. 124 125 ### Goals 126 127 To define the Runtime SDK and more specifically 128 129 - To define the rules ensuring Runtime Hooks can evolve over time: 130 - When/how to create a new version; 131 - When/how to modify the current version; 132 - When/how to deprecate an old version, as well as mechanisms to inform users about versions being deprecated; 133 - When/how to drop an old version, as well as providing a mechanism to prevent users to upgrade Cluster API when 134 this will break installed Runtime Extensions; 135 - To define the fundamental capabilities/tooling to be implemented in CAPI in order to allow the implementation of 136 Runtime Hooks. 137 - To provide an initial set of guidelines for Runtime Extension developers. 138 - To define how external Runtime Extensions can be registered within the Cluster API Runtime. 139 140 ### Non-Goals 141 142 - To identify or specify the list of Runtime Hooks that should be implemented; some examples of possible Runtime Hooks 143 will be eventually provided, but it is not in the scope of this document to define them in detail; 144 - To replace controllers or any other component of the Cluster API Runtime (including infrastructure providers, 145 bootstrap providers, control plane providers and the CRD/contract based extension mechanism they rely on). 146 147 ### Future Work 148 149 - Identify and specify the list of Runtime Hooks to be implemented; this will be addressed iteratively by a set of 150 future proposals, all of them building on top of the foundational capabilities introduced by this document; 151 - Eventually consider deprecation of machine deletion hooks and replacement with a Runtime Hook; 152 - Improve the Runtime Extension developer guide based on experience and feedback; 153 - Add metrics about Runtime Extension calls (usage, usage vs deprecated versions, duration, error rate etc.); 154 - Allow providers to use the same SDK to define their own hooks. 155 - Improve clusterctl to deploy and manage runtime extension providers 156 157 ## Proposal 158 159 ### User Stories 160 161 - As a cluster operator I want to be able to execute a particular action in well-defined moments of the Workload 162 Cluster’s lifecycle, e.g. 163 - As a cluster operator I want to automatically install the external CPI addon Before Upgrading the Cluster. 164 - As a cluster operator I want to automatically check my quota management systems Before Creating a cluster. 165 - As a cluster operator I want to automatically run Kubernetes conformance tests After a Cluster upgrade completes. 166 - As a cluster operator I want to automatically back up persistent volumes Before deleting a cluster. 167 - As a cluster operator I want to plug in a component that can provide externally generated patches while 168 computing the Cluster topology (as a fully customizable alternative to inline JSON patches available in ClusterClass). 169 170 - As a developer building systems on top of Cluster API, I would like to have guarantees about the Runtime Extensions 171 versions support, thus making it predictable and sustainable to keep up with new versions. 172 173 - As a developer building systems on top of Cluster API, I would like to implement a Runtime Extension in a 174 simple way (simpler than writing controllers). 175 176 - As a developer building systems on top of Cluster API, I would like Runtime Extension to provide a certain degree 177 of control on Cluster’s lifecycle, like e.g. block/defer an operation to start (the exact definition of the 178 kind of control each Runtime Extension can have must be part of the corresponding Runtime Hook definition). 179 180 - As a developer building systems on top of Cluster API using Golang as a development language, I would like to 181 leverage sigs.k8s.io/cluster-api as a library to speed up/ensure consistency in the implementation of 182 my Runtime Extensions. 183 184 - As a developer building systems on top of Cluster API, I would like to have a way to dynamically add/remove/replace 185 my Runtime Extensions once they are deployed. 186 187 This proposal considers also a set of additional user stories from the PoV of the Cluster API project maintainers: 188 189 - As a Cluster API maintainer I would like to provide reliable guarantees about the Runtime Hooks version support, 190 thus making it possible for the project to continue to evolve in a way that is predictable for the developers 191 implementing Runtime Extensions. 192 193 - As a Cluster API maintainer I would like to have a set of tools, utilities and conventions making it possible 194 to implement new Runtime Hooks quickly and consistently across the code base. 195 196 ### Implementation Details/Notes/Constraints 197 198 The proposed solution is designed with the intent to make developing Runtime Extensions as simple as possible, because 199 the success of this feature depends on its speed/rate of adoption in the ecosystem. 200 201 Accordingly, the proposed solution relies on a well-known, battle tested integration pattern, RESTful APIs. 202 A nice side effect of this choice is the possibility to leverage on a set api-machinery tooling and practices the 203 Cluster API maintainers are well-used to. 204 205 It is also important to notice that the model based on Runtime Hooks and Runtime Extensions implies two separate 206 personas being involved, each one with its own responsibilities in the process: 207 208  209 210 The Runtime SDK rules defined in this document are a critical element of the above split of responsibilities, 211 defining expectations for each of the above personas. 212 213 #### Cluster API Runtime Hooks vs Kubernetes admission webhooks 214 215 Runtime Hooks are inspired by Kubernetes admission webhooks, but there is one key difference that splits them apart: 216 217 - Admission webhooks are strictly linked to Kubernetes API Server/etcd **CRUD operations** e.g. Create or Update 218 Cluster in etcd. 219 - Runtime Hooks can be used to define **arbitrary operations**, e.g. `BeforeClusterUpgrade`, `BeforeMachineRemediation` etc. 220 221 In other words, Runtime Hooks are not concerned about “low-level” details of how Kubernetes handles objects in the 222 API Server/etcd; Runtime Hooks instead focus on “high-level” events of a Cluster’s lifecycle. 223 224 Please note that, no matter the similarities in some part of the design, users should not make assumptions about 225 Runtime Hooks having properties or behaviors typical of Kubernetes admission webhooks unless they are explicitly 226 defined in the following paragraphs. 227 228 #### Runtime SDK rules 229 230 As this proposal is based on RESTful APIs, we are using [OpenAPI Specification v3.0.0](https://swagger.io/specification/) [1] 231 to document Runtime Hooks supported by Cluster API. 232 233 Most specifically, a single OpenAPI document providing specification for all the Runtime Hooks supported by a 234 Cluster API release will be added to the release artifacts; users can rely on https://editor.swagger.io/ or similar 235 tools to view the specification, and during implementation we will consider adding similar view to the Cluster API 236 book as well e.g. 237 238  239 240 Each Runtime Hook will be defined by one (or more) RESTful APIs implemented as a `POST` operation; each operation 241 is going to receive a request parameter as a request body, and return a response value as response body, both 242 `application/json` encoded and with a schema of arbitrary complexity that should be considered an integral part of 243 the Runtime Hook definition. 244 245 It is also worth noting that more than one version of the same Runtime Hook might be supported at the same time; 246 e.g. in the example above the `BeforeClusterUpgrade` Hook exist in version `v1alpha1` (old version) 247 and `v1alpha2` (current). 248 249 Supporting more versions at the same time is a requirement in order to: 250 251 - Allow Cluster API maintainers to continue to develop and evolve Runtime Hooks in a predictable way. 252 - Provide a well-defined set of guarantees to Runtime Extension implementers they can rely on while developing 253 solutions on top of CAPI. 254 255 In their simplest form guarantees for Runtime Hook versions are: 256 257 - Once a Runtime Hook version has been published, breaking changes are not allowed without bumping to a new version 258 (e.g. fields removal/renaming) 259 - Before removing a Runtime Hook version, a deprecation period should be respected, with a duration depending 260 on the maturity of the API itself (12 Months/3 Versions GA, 6M/2V beta, 0 alpha). 261 262 The formal definition of the Runtime SDK rules derived from https://kubernetes.io/docs/reference/using-api/deprecation-policy/, 263 can be found in the annex at the end of the document. Please note that during implementation we will consider a 264 mechanism allowing to: 265 266 - Inform admins about Runtime Extension using a deprecated version of a Runtime Hook (e.g. return a well known 267 HTTP header, set a condition on the ExtensionConfig object defined in the following paragraphs, 268 webhook warnings on ExtensionConfig create/update). 269 - Prevent upgrades to new Cluster API versions that makes configured Runtime Extension not functional due to 270 the expiration of the deprecation period (e.g. implement a preflight check in the `clusterctl upgrade` command 271 or a validation webhook, if possible). 272 273 [1] This is the most recent OpenAPI Specification supported by https://github.com/kubernetes/kube-openapi 274 275 ## Runtime Extensions developer guide 276 277 The following sections have been moved to the Cluster API book to avoid duplication: 278 279 * [Implementing Runtime Extensions](../../docs/book/src/tasks/experimental-features/runtime-sdk/implement-extensions.md) 280 * [Deploying Runtime Extensions](../../docs/book/src/tasks/experimental-features/runtime-sdk/implement-extensions.md) 281 282 ### Registering Runtime Extensions 283 284 _Important! Cluster administrators should carefully vet any Runtime Extension registration, thus preventing 285 malicious components from being added to the system._ 286 287 _Creating ExtensionConfigs will be allowed only if the Runtime Extension feature flag is set to true._ 288 289 By registering a Runtime Extension the Cluster API Runtime becomes aware of a Runtime Extension implementing a 290 Runtime Hook, and as a consequence the runtime starts calling the extension at well-defined moments of the 291 workload cluster’s lifecycle. 292 293 This process has many similarities with registering dynamic webhooks in Kubernetes, but some specific 294 behavior is introduced by this proposal: 295 296 The Cluster administrator is required to register available Runtime Extension server using the following CR: 297 298 ```yaml 299 apiVersion: runtime.cluster.x-k8s.io/v1alpha1 300 kind: ExtensionConfig 301 metadata: 302 name: "my-amazing-extensions" 303 spec: 304 clientConfig: 305 #`url` gives the location of the RuntimeExtension, in standard URL form (`scheme://host:port/path`). Exactly one of `url` or `service` must be specified. 306 url: "..." 307 service: 308 namespace: "example-namespace" 309 name: "example-service" 310 # `path` is an optional path prefix path which can be sent in any request to this service. 311 path: "runtime-extensions/" 312 # If specified, the port on the service that hosts the RuntimeExtension. Default to 443. `port` should be a valid port number (1-65535, inclusive). 313 port: 8082 314 caBundle: "..." 315 # NamespaceSelector decides whether to run the webhook on a Cluster based on whether the namespace for that Cluster matches the selector. 316 # If not specified, the WebHook runs for all the namespaces. 317 namespaceSelector: {} 318 # settings is a map[string]string which is sent with each request to a Runtime Extension. These settings can be used by 319 # to modify the behaviour of a Runtime Extension. 320 settings: {} 321 ``` 322 323 Once the extension is registered the [discovery hook](#discovery-hook) is called and the above CR is updated with the list 324 of the Runtime Extensions supported by the server. The ExtensionConfig is Cluster scoped, meaning it has no namespace. 325 The `namespaceSelector` will enable targeting of a subset of Clusters. 326 327 ```yaml 328 329 apiVersion: runtime.cluster.x-k8s.io/v1alpha1 330 kind: ExtensionConfig 331 metadata: 332 name: "my-amazing-extensions" 333 spec: 334 ... 335 status: 336 handlers: ## Details of supported Runtime Extensions 337 - name: "http-proxy.my-amazing-extensions" # unique name, computed 338 requestHook: 339 apiVersion: "hook.runtime.cluster.x-k8s.io/v1alpha1" 340 hook: "generatePatches" 341 timeoutSeconds: 5 # Timeout to be used when calling the extension. Max timeout allowed 10s. 342 failurePolicy: Fail # FailurePolicy defines how unrecognized errors from the admission endpoint are handled - allowed values are Ignore or Fail. Defaults to Fail. 343 - ... 344 conditions: 345 ... 346 ``` 347 348 As you can notice, each Runtime Extension is given a unique identifier that can be used to reference it from other 349 part of the system, e.g. from ClusterClass. Additionally, it is documented the exact reference to the hook/version 350 the Runtime Extension is implementing as well as the failurePolicy and the timeout the system should use when 351 calling the extension. 352 353 If consensus is reached/in a follow-up iteration we consider to eventually add support for defining 354 Runtime Extensions that applies to a subset of Clusters/object only, by adding to the CR used for registration the 355 following field: 356 357 ```yaml 358 # ObjectSelector decides whether to run the webhook on objects (e.g. Clusters) based on whether the Cluster object matches the selector. 359 # If not specified, the WebHook runs for all the objects. 360 objectSelector: 361 ... 362 363 ``` 364 365 Instead, unless there's a strong and evident need for it, we are not considering adding support for defining 366 dependencies among Runtime Extensions, being it modeled with something similar to 367 [systemd unit options](https://www.freedesktop.org/software/systemd/man/systemd.unit.html) or alternative approaches. 368 369 The main reason behind that is that such type of feature introduces complexity and creates "pet" like relations across 370 components making the overall system more fragile. This is also consistent with the [avoid dependencies](#avoid-dependencies) 371 recommendation above. 372 373 ## Runtime Hooks developer guide (CAPI internals) 374 375 _Following notes provide details about how Runtime Hook will be implemented in the Cluster API codebase; 376 if you are not interested in CAPI internals you can skip this section._ 377 378 ### Runtime hook implementation 379 380 The process of implementing the new Runtime Hooks is intentionally designed in order to mimic the steps currently 381 used to define API types, thus providing a familiar experience to the maintainers/the people used to look at the 382 Cluster API codebase. Most specifically: 383 384 - Runtime Hooks versions must be defined under the `/exp/runtime/hooks/api` folder. 385 - There must be one folder per apiVersion, e.g. `/v1alpha1`, `/v1alpha2` etc. 386 387 ``` 388 /exp/runtime/hooks/api 389 ├── v1alpha1 390 └── v1alpha2 391 ``` 392 393 Each version folder must 394 395 - Define a group version 396 - Provide type definitions for the Runtime Hook and its request and response parameters. 397 398 ``` 399 /exp/runtime/hooks/api/v1alpha1 400 ├── groupversion_info.go 401 └── lifecyclehooks_types.go 402 ``` 403 404 Type definitions are standard Golang type definitions with Golang JSON tags and a set of additional k8s/kubebuilder 405 markers triggering code generators for: 406 407 - DeepCopy functions, so that request and response parameter types satisfy the `runtime.Object` interface. 408 - Conversion functions from older apiVersions of the Runtime Hook request and response parameter types to the latest one. 409 - OpenAPI schema definitions for each type. 410 411 ```go 412 // BeforeClusterUpgradeRequest is the request of the BeforeClusterUpgrade hook. 413 // +k8s:openapi-gen=true 414 // +kubebuilder:object:generate=true 415 // +kubebuilder:object:root=true 416 type BeforeClusterUpgradeRequest struct { 417 metav1.TypeMeta `json:",inline"` 418 419 ... 420 } 421 422 // BeforeClusterUpgradeResponse is the response of the BeforeClusterUpgrade hook. 423 // +k8s:openapi-gen=true 424 // +kubebuilder:object:generate=true 425 // +kubebuilder:object:root=true 426 type BeforeClusterUpgradeResponse struct { 427 metav1.TypeMeta `json:",inline"` 428 429 ... 430 } 431 432 // BeforeClusterUpgrade is the hook that will be called after a Cluster.spec.version is upgraded and 433 // before the updated version is propagated to the underlying objects. 434 func BeforeClusterUpgrade(*BeforeClusterUpgradeRequest, *BeforeClusterUpgradeResponse) {} 435 ``` 436 437 The code generators are https://github.com/kubernetes-sigs/controller-tools and https://github.com/kubernetes/kube-openapi; 438 the expected output will be similar to: 439 440 ``` 441 /runtime/contract/cluster/v1alpha1 442 ├── groupversion_info.go 443 ├── lifecyclehooks_types.go 444 ├── zz_generated.conversion.go 445 ├── zz_generated.deepcopy.go 446 └── zz_generated.openapi.go 447 ``` 448 449 Similarly to what happens for API types and api-machinery schema, the type definitions inside every version folder 450 have to be added to a `Catalog`, but with a few notable differences: 451 452 - The `Catalog` tracks mapping between a group/version/hook and its own corresponding request/response types 453 (group/version/request-GVK and group/version/response-GVK). 454 - Type conversions are allowed between objects with the same group/hook (instead of being in a “flat type-space” 455 like in the api-machinery schema). 456 457 `groupversion_info.go`: 458 ```go 459 var ( 460 // GroupVersion is the group version identifying Runtime Hooks defined in this package 461 // and their request and response types. 462 GroupVersion = schema.GroupVersion{Group: "hooks.runtime.cluster.x-k8s.io", Version: "v1alpha1"} 463 464 // catalogBuilder is used to add Runtime Hooks and their request and response types 465 // to a Catalog. 466 catalogBuilder = &runtimecatalog.Builder{GroupVersion: GroupVersion} 467 468 // AddToCatalog adds Runtime Hooks defined in this package and their request and 469 // response types to a catalog. 470 AddToCatalog = catalogBuilder.AddToCatalog 471 472 // localSchemeBuilder provide access to the SchemeBuilder used for managing Runtime Hooks 473 // and their request and response types defined in this package. 474 // NOTE: This object is required to allow registration of automatically generated 475 // conversions func. 476 localSchemeBuilder = catalogBuilder 477 ) 478 479 func init() { 480 // Add Open API definitions for RuntimeHooks request and response types in this package 481 // NOTE: the GetOpenAPIDefinitions func is automatically generated by openapi-gen. 482 catalogBuilder.RegisterOpenAPIDefinitions(GetOpenAPIDefinitions) 483 } 484 ``` 485 486 `lifecyclehooks_types.go`: 487 ```go 488 func init() { 489 // Register Runtime Hooks defined in this package. 490 catalogBuilder.RegisterHook(BeforeClusterUpgrade, &runtimecatalog.HookMeta{ 491 Tags: []string{"Lifecycle Hooks"}, 492 Summary: "Called before the Cluster is upgraded.", 493 Description: "This blocking hook is called after the Cluster object has been updated with a new spec.topology.version by the user, and immediately before the new version is propagated to the Control Plane.", 494 }) 495 } 496 ``` 497 498 Given the above definitions, a catalog can finally be created as follows: 499 500 ```go 501 var c = catalog.NewCatalog() 502 503 func init() { 504 v1alpha1.AddToCatalog(c) 505 v1alpha2.AddToCatalog(c) 506 v1alpha3.AddToCatalog(c) 507 } 508 ``` 509 510 The catalog provides the core knowledge required to manage all the Runtime Hooks supported by Cluster API; 511 the first application of such knowledge will be to retrieve all the info required to generate the OpenAPI specification 512 for Runtime Hooks with a dedicated tool under `hack/tools`. 513 514 ### Discovering Runtime Extensions 515 516 _Note: the controller described in this paragraph will be executed only if the Runtime Extension feature flag is set to true._ 517 518 Cluster API is going to implement a new controller that looks at Runtime Extension Configurations; the main 519 responsibility of this controller should be to maintain an internal, shared **registry** of available extensions 520 at a given time. 521 522 Please note that the Runtime Extensions registry also provides a single point to centralize a set of common behaviors 523 supporting interaction with those external components, thus making the adoption of this feature scalable - 524 in the sense of being used for an increasing numbers of use cases in Cluster API - while operating consistently 525 across the board. 526 527 A first behavior that falls into this category is the implementation of exponential backoff mechanisms 528 in case of errors, thus preventing Cluster API from creating pressure on HTTP Servers recovering from or with 529 ongoing operational issues. 530 531 Another cross-cutting concern is about ensuring that Runtime Extensions, which are external components triggered 532 in the middle of Cluster API controllers logic, do not block the reconciliation process indefinitely 533 (e.g by enforcing a maximum timeout for all the Runtime Extensions calls). 534 535 ### Calling Runtime Extensions 536 537 _Note: the code described in this paragraph will be executed only if the Runtime Extension feature flag is set to true._ 538 539 Cluster API is going to implement calls to registered Runtime Extensions at well-known moments of the Cluster’s lifecycle. 540 541 The two key elements that make the implementation of runtime extension calls simple and consistent across 542 the codebase are: 543 544 - The catalog, providing the info about all the defined Runtime Hooks, supported version and 545 corresponding request/response types; 546 - The client, implementing the call to a Runtime Extension. 547 548 Given these two elements, the code for calling a Runtime Extension is: 549 550 `main.go`: 551 ```go 552 var ( 553 // Create a Catalog. 554 catalog = runtimecatalog.New() 555 ... 556 ) 557 558 func init() { 559 ... 560 // Register the RuntimeHook types into the catalog. 561 _ = runtimehooksv1.AddToCatalog(catalog) 562 ... 563 } 564 565 func setupReconcilers(ctx context.Context, mgr ctrl.Manager) { 566 ... 567 // Setup the runtime client. 568 runtimeClient = runtimeclient.New(runtimeclient.Options{ 569 Catalog: catalog, 570 Registry: runtimeregistry.New(), 571 Client: mgr.GetClient(), 572 }) 573 ... 574 // Pass the runtime client to a reconciler. 575 if err := (&controllers.ClusterTopologyReconciler{ 576 Client: mgr.GetClient(), 577 APIReader: mgr.GetAPIReader(), 578 RuntimeClient: runtimeClient, 579 UnstructuredCachingClient: unstructuredCachingClient, 580 WatchFilterValue: watchFilterValue, 581 }).SetupWithManager(ctx, mgr, concurrency(clusterTopologyConcurrency)); err != nil { 582 setupLog.Error(err, "unable to create controller", "controller", "ClusterTopology") 583 os.Exit(1) 584 } 585 ... 586 } 587 ``` 588 589 `cluster_controller.go`: 590 ```go 591 // Call BeforeClusterCreate Runtime Extensions. 592 hookRequest := &runtimehooksv1.BeforeClusterCreateRequest{ 593 Cluster: *s.Current.Cluster, 594 } 595 hookResponse := &runtimehooksv1.BeforeClusterCreateResponse{} 596 if err := r.RuntimeClient.CallAllExtensions(ctx, runtimehooksv1.BeforeClusterCreate, s.Current.Cluster, hookRequest, hookResponse); err != nil { 597 return ctrl.Result{}, err 598 } 599 } 600 ``` 601 602 A couple of elements are worth noting: 603 604 - `CallAllExtensions` will call all registered Runtime Extensions of the corresponding group and hook. 605 This will also include Runtime Extensions implementing older versions of the same Runtime Hook. 606 - The call is implemented using the latest version of the Runtime Hook/request/response types; the 607 `CallAllExtensions` function will take care of version conversions, if required. 608 609 ## Security Model 610 611 Following threats were considered: 612 613 - Malicious Runtime Extensions being registered 614 615 Mitigation: The same mitigations used for avoiding malicious dynamic webhooks in Kubernetes apply 616 (defining RBAC rules for the ExtensionConfig assigning this responsibility to cluster admin only). 617 618 - Privilege escalation of HTTP Servers running Runtime Extensions 619 620 Mitigation: The same mitigations used for any HTTP server deployed in Kubernetes apply 621 (use distroless base image, do not use privileged pods etc.). 622 623 - Tampering of the communication channel between Cluster API controllers and HTTP Servers implementing Runtime Extensions. 624 625 Mitigation: The same mitigations used for any HTTP server deployed in Kubernetes apply (use SSL, Network policies etc.). 626 627 ## Risks and Mitigations 628 629 - Building Runtime SDK, Runtime Hooks and Runtime Extensions in sequential steps might lead to reworks. 630 631 This is an accepted risk, given the importance of defining a robust SDK before external developers start relying 632 extensively on this feature. 633 634 ## Alternatives 635 636 - Using in-process plugins vs calling external components 637 638 Plugins has been considered (golang native plugins, grpc plugins with https://github.com/hashicorp/go-plugin 639 and also webassembly) but the option has been discarded given that this approach could introduce instability – 640 due to external components running alongside Cluster API components – and also has a more complex threat model, 641 given that those components could potentially inherit and exploit the permission given to Cluster API components. 642 643 - Using grpc instead of RESTful APIs. 644 645 Even if grpc could provide some advantages in terms of performance, the option has been discarded given that 646 using RESTful APIs it is easier to implement a framework that mimics Kubernetes APIs (do not reinvent the wheel, 647 leverage on api-machinery, controller tools, kube-openapi, provide a familiar developer experience). 648 649 ## Upgrade Strategy 650 651 This proposal does not affect Cluster API providers or Cluster API cluster’s upgrade strategy or version skew. 652 However, rules for evolving Runtime Hook across Cluster API versions are introduced. 653 654 ## Additional Details 655 656 ### Test Plan 657 658 While in alpha phase it is expected that the Runtime SDK will have unit tests covering all the main components: 659 catalog, discovery controller, tooling. 660 661 With the increasing adoption of this feature, we expect more unit tests, integration tests and E2E tests 662 to be added covering specific Runtime Hooks. 663 664 ### Graduation Criteria 665 666 Main criteria for graduating this feature is adoption; further detail about graduation criteria will be added 667 in future iterations of this document. 668 669 ### Version Skew Strategy 670 671 See upgrade strategy. 672 673 ## Annex 674 675 ### Runtime SDK rules 676 677 **Rule #1: Runtime Hooks and request/response parameter elements may only be removed by incrementing the version of the 678 Runtime Hook.** 679 680 Once a Runtime Hook or a Runtime Hook request/response parameter element has been added to a particular version, 681 it can not be removed from that version or have its behavior significantly changed. 682 683 **Rule #2 Runtime Hook’s request parameters must be down-convertible, response parameters must be up-convertible. 684 Most specifically** 685 686 - request parameters must be able to be down-converted from the latest version to previous versions of the same 687 Runtime Hook; this might imply information loss, but the behavior of the previous version of the Runtime Hook 688 must not be affected by this. 689 - response parameters must be able to be up-converted from previous versions to current versions of the same 690 Runtime Hook; this means that new information should be nullable or have defaults. 691 692 For example assume that we have a `BeforeClusterUpgrade` Runtime Hook with version `v1alpha1` and `v1alpha2`; 693 In order to avoid duplicating code, Cluster API internally will always work at the latest version, `v1alpha2` 694 in the example, but there could be still a deployed Runtime Extension on `v1alpha1`. 695 696 This rule makes it possible to call the Runtime Extensions still using the `v1alpha1` by ensuring it is possible 697 to down-converting the request parameter for the `v1alpha2` call implemented in CAPI, make the call, and then 698 up-converting the `v1alpha1` response parameter to the v1alpha2 version `CAPI` expects. 699 700 **Rule #3: A Runtime Hook version in a given track may not be deprecated until a new version at least as stable 701 is released.** 702 703 GA Runtime Hook versions can replace GA or beta Runtime Hook versions; beta Runtime Hook versions may not replace 704 GA Runtime Hook API versions etc. 705 706 **Rule #4: Other than the most recent Runtime Hook versions in each track, older Runtime Hook versions must 707 be supported after their announced deprecation for a duration of no less than: 708 709 - GA: 12 months or 3 releases (whichever is longer) 710 - Beta: 6 months or 2 releases (whichever is longer) 711 - Alpha: 0 releases 712 ** 713 714 ### Discovery hook 715 716 The Discovery hook must be implemented by all the Runtime Extensions servers, and it is responsible to 717 inform the system about the Runtime Extensions it implements. 718 719 When invoked the discovery hook is expected to provide the following answer: 720 721 ```yaml 722 status: Success # or Failure 723 message: "error message if status == Failure" 724 handlers: # Info about implemented runtime extensions 725 - name: http-proxy # Unique name identifying the runtime extension 726 requestHook: 727 apiVersion: "hook.runtime.cluster.x-k8s.io/v1alpha1" 728 hook: "generatePatches" 729 timeoutSeconds: 5 # Default value suggested by the RuntimeExtension developers 730 failurePolicy: Fail # Default value suggested by the RuntimeExtension developers 731 - ... 732 ``` 733 734 Please note that the above struct supports defining more than one Runtime Extension for the same hook, e.g. 735 defining more than one "generatePatches" extensions. 736 737 ## Implementation History 738 739 - [x] 2021-08-30: Proposed idea in an [issue](https://github.com/kubernetes-sigs/cluster-api/issues/5175) 740 - [x] 2022-02-08: Compile a [Google Doc](https://docs.google.com/document/d/15USA_Gxv3nWYa7bB_2JAtv4tODBNTrFHumg3lMG8WqI/edit?usp=sharing) following the CAEP template. 741 - [x] 2022-02-09: Present proposal at a [community meeting] 742 - [x] 2022-02-21: Open proposal PR 743 744 <!-- Links --> 745 [community meeting]: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI/edit#heading=h.pxsq37pzkbdq