sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/proposal/20200506-single-controller-multitenancy.md (about) 1 --- 2 title: Single Controller Multitenancy 3 authors: 4 - "@randomvariable" 5 - "@andrewmyhre" 6 reviewers: 7 - "@ashoksekar07" 8 - "@bagnaram" 9 - "@CecileRobertMichon" 10 - "@detiber" 11 - "@devigned" 12 - "@nader-ziada" 13 - "@ncdc" 14 - "@richardcase" 15 - "@rudoi" 16 - "@sethp-nr" 17 creation-date: 2020-05-05 18 last-updated: 2020-05-05 19 status: implementable 20 see-also: 21 - https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/586 22 replaces: [] 23 superseded-by: [] 24 --- 25 26 # Single Controller Multitenancy 27 28 ## Table of Contents 29 30 - [Single Controller Multitenancy](#single-controller-multitenancy) 31 - [Table of Contents](#table-of-contents) 32 - [Glossary](#glossary) 33 - [Summary](#summary) 34 - [Motivation](#motivation) 35 - [Goals](#goals) 36 - [Non-Goals/Future Work](#non-goalsfuture-work) 37 - [Proposal](#proposal) 38 - [User Stories](#user-stories) 39 - [Story 1](#story-1) 40 - [Story 2](#story-2) 41 - [Story 3](#story-3) 42 - [Requirements](#requirements) 43 - [Functional](#functional) 44 - [Non-Functional](#non-functional) 45 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 46 - [Proposed Changes](#proposed-changes) 47 - [Cluster API Provider AWS v1alpha3 types](#cluster-api-provider-aws-v1alpha3-types) 48 - [Controller Changes](#controller-changes) 49 - [Clusterctl changes](#clusterctl-changes) 50 - [Validating webhook changes](#validating-webhook-changes) 51 - [Identity Type Credential Provider Behaviour](#identity-type-credential-provider-behaviour) 52 - [Security Model](#security-model) 53 - [Roles](#roles) 54 - [RBAC](#rbac) 55 - [Write Permissions](#write-permissions) 56 - [Namespace Restrictions](#namespace-restrictions) 57 - [CAPA Controller Requirements](#capa-controller-requirements) 58 - [Alternative Approaches Considered](#alternative-approaches-considered) 59 - [Using only secrets or RoleARN field on AWSCluster](#using-only-secrets-or-rolearn-field-on-awscluster) 60 - [1:1 mapping one namespace to one AWSIdentity](#11-mapping-one-namespace-to-one-awsidentity) 61 - [Risks and Mitigations](#risks-and-mitigations) 62 - [Network assumptions are made explicit](#network-assumptions-are-made-explicit) 63 - [Caching and handling refresh of credentials](#caching-and-handling-refresh-of-credentials) 64 - [AWS Cloud Provider behaviour with regards to cluster names](#aws-cloud-provider-behaviour-with-regards-to-cluster-names) 65 - [Upgrade Strategy](#upgrade-strategy) 66 - [Additional Details](#additional-details) 67 - [Test Plan](#test-plan) 68 - [Graduation Criteria](#graduation-criteria) 69 - [Alpha](#alpha) 70 - [Beta](#beta) 71 - [Stable](#stable) 72 - [Version Skew Strategy](#version-skew-strategy) 73 - [Implementation History](#implementation-history) 74 75 ## Glossary 76 77 * Identity Type - One of several ways to provide a form of identity that is ultimately resolved to an AWS access key ID, 78 secret access key and optional session token tuple. 79 * Credential Provider - An implementation of the interface specified in the [AWS SDK for 80 Go][aws-sdk-go-credential-provider]. 81 * CAPA - An abbreviation of Cluster API Provider AWS. 82 83 ## Summary 84 85 The CAPA operator is able to manage cloud infrastructure within the permission scope of the AWS principle it is 86 initialized with. It is expected that the CAPA operator will be provided credentials via the deployment, either 87 explicitly via environment variables or implicitly via the default SDK credential provider chain, including EC2 instance 88 profiles or metadata proxies such as [kiam][kiam]. 89 90 Currently, without a custom deployment, role assumption in CAPA can only take place using a metadata proxy such as 91 [kiam][kiam], or using OIDC based service account token volume projection, requiring Kubernetes 1.16, and special setup 92 of certificates with an appropriate OIDC trust. 93 94 It is technically possible to provide an IAM role via a shared configuration file, but the UX has been found to be 95 complicated, and additional breaking code changes have been suggested to resolve these issues. 96 97 In addition, where role assumption is configured, that role is used for the entire lifetime of the CAPA deployment. This 98 also means that an AWSCluster could be broken if the instance of CAPA that created it is misconfigured for another set 99 of credentials. 100 101 This proposal outlines new capabilities for CAPA to use IAM role assumption to assume a permission set in a different 102 AWS account, at runtime, on a per-cluster basis. The proposed changes would be fully backwards compatible and maintain 103 the existing behavior with no changes to user configuration required. 104 105 106 ## Motivation 107 108 For large organizations, especially highly-regulated organizations, there is a need to be able to perform separate 109 duties at various levels of infrastructure - permissions, networks and accounts. VPC sharing is a model which provides 110 separation at the AWS account level. Within this model it is appropriate for tooling running within the 'management' account 111 to manage infrastructure within the 'workload' accounts, which requires a identity in the management account which can 112 assume a identity within a workload account. For CAPA to be most useful within these organizations it will need to 113 support multi-account models. 114 115 Some organizations may also delegate the management of clusters to another third-party. In that case, the boundary 116 between organizations needs to be secured, and issues such as confused deputy scenarios need to be mitigated. 117 118 Because a single deployment of the CAPA operator may reconcile many different clusters in its lifetime, it is necessary 119 to modify the CAPA operator to scope its AWS client instances to within the reconciliation process. 120 121 AWS provides a number of mechanisms for assume roles across account boundaries: 122 123 * Role assumption using credentials provided in environment variables 124 * Role assumption using credentials provided by a metadata service, e.g. IMDS or ECS 125 * Role assumption using an OIDC JWT token 126 * Role assumption using explicitly provided credentials 127 128 The first three methods are supported per instance of the CAPA controller today. Of special note is OIDC, which was 129 introduced in 2019 as a way to leverage projected service account tokens from Kubernetes to assume role. This allows for 130 fine-grained specification of IAM roles using Kubernetes primitives, but is still limited to a single role assumption 131 per instance of CAPA. 132 133 ### Goals 134 135 1. To enable AWSCluster resources reconciliation across AWS account boundaries 136 2. To maintain backwards compatibility and cause no impact for users who don't intend to make use of this capability 137 138 ### Non-Goals/Future Work 139 140 - To enable Machines to be provisioned in AWS accounts different than their control planes. This would require adding 141 various AWS infrastructure specification to the AWSMachineTemplate type which currently does not exist. 142 - To enable control plane and worker machines to be provisioned in separate accounts. 143 144 ## Proposal 145 146 ### User Stories 147 148 #### Story 1 149 150 Alex is an engineer in a large organization which has a strict AWS account architecture. This architecture dictates that 151 Kubernetes clusters must be hosted in dedicated AWS accounts. The organization has adopted Cluster API in order to 152 manage Kubernetes infrastructure, and expects 'management' clusters running the Cluster API controllers to manage 'workload' 153 clusters in dedicated AWS accounts. 154 155 The current configuration exists: AWS Account 'management': 156 * Vpc, subnets shared with 'workload' AWS Account 157 * EC2 instance profile linked to the IAM role 'ClusterAPI-Mgmt' 158 * A management Kubernetes cluster running Cluster API Provider AWS controllers using the 'ClusterAPI-Owner' IAM identity. 159 160 AWS Account 'workload': 161 * Vpc and subnets provided by 'Owner' AWS Account 162 * IAM Role 'ClusterAPI-Participant' which trusts IAM role 'ClusterAPI-Owner' 163 164 Alex can provision a new cluster in the 'workload' AWS Account by creating new Cluster API resources in the management cluster. 165 Cluster 'Owner’ Alex specifies the IAM role 'ClusterAPI-Participant' in the AWSCluster spec. The CAPA controller in the management cluster assumes the role 'ClusterAPI-Participant' when reconciling the AWSCluster so that it can 166 create/use/destroy resources in the 'workload' AWS Account. 167 168 169 #### Story 2 170 171 Dascha is an engineer in a smaller, less strict organization with a few AWS accounts intended to host all 172 infrastructure. There is a single AWS account named 'dev', and Dascha wants to provision a new cluster in this account. 173 An existing Kubernetes cluster is already running the Cluster API operators and managing resources in the dev account. 174 Dascha can provision a new cluster by creating Cluster API resources in the existing cluster, omitting the IAMRoleARN 175 field in the AWSCluster spec. The CAPA operator will not attempt to assume an IAM role and instead will use the AWS 176 credentials provided in its deployment template (using Kiam, environment variables or some other method of obtaining 177 credentials). 178 179 #### Story 3 180 181 ACME Industries is offering Kubernetes as a service to other organizations, and follows AWS guidelines for 182 Software-as-a-Service. This means they want to use cross-account role assumptions for access to customer systems. ACME 183 Industries also wants to prevent cross-organisation attacks, so they use “external IDs” to prevent confused deputy 184 scenarios. ACME Industries wants to minimise the memory footprint of managing many clusters, and wants to move to having 185 a single instance of CAPA to cover multiple organisations. 186 187 ## Requirements 188 189 ### Functional 190 191 <a name="FR1">FR1.</a> CAPA MUST support IAM role assumption using the STS::AssumeRole API. 192 193 <a name="FR2">FR2.</a> CAPA MUST support session names and external ID to prevent confused deputy attacks. 194 195 <a name="FR3">FR3.</a> CAPA MUST support static credentials. 196 197 <a name="FR4">FR4.</a> CAPA MUST prevent privilege escalation allowing users to create clusters in AWS accounts they should 198 not be able to. 199 200 <a name="FR5">FR5.</a> CAPA SHOULD support credential refreshing when identity data is modified. 201 202 <a name="FR6">FR6.</a> CAPA SHOULD provide validation for identity data submitted by users. 203 204 <a name="FR7">FR7.</a> CAPA COULD support role assumption using OIDC projected volume service account tokens. 205 206 <a name="FR9">FR9.</a> CAPA MUST support clusterctl move scenarios. 207 208 ### Non-Functional 209 210 <a name="NFR8">NFR8.</a> Each instance of CAPA SHOULD be able to support 200 clusters using role assumption. 211 212 <a name="NFR8">NFR9.</a> CAPA MUST call STS APIs only when necessary to prevent rate limiting. 213 214 <a name="NFR8">NFR10.</a> Unit tests MUST exist for all credential provider code. 215 216 <a name="NFR8">NFR11.</a> e2e tests SHOULD exist for all credential provider code. 217 218 <a name="NFR8">NFR12.</a> Credential provider code COULD be audited by security engineers. 219 220 ### Implementation Details/Notes/Constraints 221 222 The current implementation of CAPA requests a new AWS EC2 and ELB service per cluster and per machine and stores these 223 in fields on the ClusterScope struct. ClusterScopes are reference values which are created per-reconciliation: 224 225 ```go 226 type ClusterScope struct { 227 logr.Logger 228 client client.Client 229 patchHelper *patch.Helper 230 231 AWSClients 232 Cluster *clusterv1.Cluster 233 AWSCluster *infrav1.AWSCluster 234 } 235 ``` 236 237 The field AWSClients holds the actual AWS service clients, and is defined like so: 238 239 ```go 240 type AWSClients struct { 241 EC2 ec2iface.EC2API 242 ELB elbiface.ELBAPI 243 SecretsManager secretsmanageriface.SecretsManagerAPI 244 ResourceTagging resourcegroupstaggingapiiface.ResourceGroupsTaggingAPIAPI 245 } 246 ``` 247 248 The signatures for the functions which create these instances are as follows: 249 250 ```go 251 func NewClusterScope(params ClusterScopeParams) (*ClusterScope, error) { 252 ... 253 return &ClusterScope{ 254 ... 255 }, nil 256 } 257 ``` 258 259 #### Proposed Changes 260 261 This proposal borrows heavily from the Service APIs specification. 262 263 In the initial implementation, all new resources will be scoped at the cluster level, this is to enable delegation of 264 AWS accounts whilst preventing privilege escalation as per [FR4](#FR4). Reasons for this are documented in the 265 [alternatives](#alternatives) section. 266 267 ##### Cluster API Provider AWS v1alpha3 types 268 269 <strong><em>Changed Resources</strong></em> 270 * `AWSCluster` 271 272 <strong><em>New Resources</strong></em> 273 274 <em>Cluster scoped resources</em> 275 276 * `AWSClusterControllerIdentity` represents an intent to use Cluster API Provider AWS Controller credentials for management cluster. 277 * `AWSClusterStaticIdentity` represents a static AWS tuple of credentials. 278 * `AWSClusterRoleIdentity` represents an intent to assume an AWS role for cluster management. 279 280 <em>Namespace scoped resources</em> 281 282 * `AWSServiceAccountIdentity` represents the use of a Kubernetes service account for access 283 to AWS using federated identity. 284 285 <strong><em>Changes to AWSCluster</em></strong> 286 287 A new field is added to the `AWSClusterSpec` to reference a identity. We intend to use `corev1.LocalObjectReference` in 288 order to ensure that the only objects that can be references are either in the same namespace or are scoped to the 289 entire cluster. 290 291 ```go 292 // AWSIdentityKind defines allowed AWS identity types 293 type AWSIdentityKind string 294 295 type AWSIdentityRef struct { 296 Kind AWSIdentityKind `json:"kind"` 297 Name string `json:"name"` 298 } 299 300 type AWSClusterSpec struct { 301 ... 302 // +optional 303 IdentityRef *AWSIdentityRef `json:"identityRef,omitempty"` 304 // AccountID is the AWS Account ID for this cluster 305 // +optional 306 AccountID *string `json:"identityRef,omitempty"` 307 ``` 308 309 An example usage would be: 310 311 ```yaml 312 --- 313 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 314 kind: AWSCluster 315 metadata: 316 name: "test" 317 namespace: "test" 318 spec: 319 region: "eu-west-1" 320 identityRef: 321 kind: AWSClusterRoleIdentity 322 name: test-account-role 323 --- 324 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 325 kind: AWSClusterRoleIdentity 326 metadata: 327 name: "test-account-role" 328 spec: 329 ... 330 ``` 331 332 The `IdentityRef` field will be mutable in order to support `clusterctl move` 333 scenarios where a user instantiates a cluster on their laptop and then makes 334 the cluster self-managed. 335 336 <strong><em>Identity CRDs</em></strong> 337 338 <em>Common elements</em> 339 340 All AWSCluster*Identity types will have Spec structs with an `AllowedNamespaces` 341 field as follows: 342 343 ```go 344 345 type AWSClusterIdentitySpec struct { 346 // AllowedNamespaces is used to identify which namespaces are allowed to use the identity from. 347 // Namespaces can be selected either using an array of namespaces or with label selector. 348 // An empty allowedNamespaces object indicates that AWSClusters can use this identity from any namespace. 349 // If this object is nil, no namespaces will be allowed (default behaviour, if this field is not provided) 350 // A namespace should be either in the NamespaceList or match with Selector to use the identity. 351 // 352 // +optional 353 AllowedNamespaces *AllowedNamespaces `json:"allowedNamespaces"` 354 } 355 356 type AllowedNamespaces struct { 357 // An nil or empty list indicates that AWSClusters cannot use the identity from any namespace. 358 // 359 // +optional 360 // +nullable 361 NamespaceList []string `json:"list"` 362 363 // AllowedNamespaces is a selector of namespaces that AWSClusters can 364 // use this ClusterIdentity from. This is a standard Kubernetes LabelSelector, 365 // a label query over a set of resources. The result of matchLabels and 366 // matchExpressions are ANDed. 367 // 368 // An empty selector indicates that AWSClusters cannot use this 369 // AWSClusterIdentity from any namespace. 370 // +optional 371 Selector metav1.LabelSelector `json:"selector"` 372 } 373 374 ``` 375 376 All identities based around AWS roles will have the following fields in their 377 spec, as per FR1, : 378 379 ```go 380 type AWSRoleSpec struct { 381 // The Amazon Resource Name (ARN) of the role to assume. 382 // +kubebuilder:validation:Pattern:=[\u0009\u000A\u000D\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]+ 383 RoleArn string `json:"roleARN"` 384 // An identifier for the assumed role session 385 // +kubebuilder:validation:Pattern:=[\w+=,.@-]* 386 SessionName string `json:"sessionName,omitempty"` 387 // The duration, in seconds, of the role session before it is renewed. 388 // +kubebuilder:validation:Minimum:=900 389 // +kubebuilder:validation:Maximum:=43200 390 DurationSeconds uint `json:"durationSeconds,omitempty"` 391 // An IAM policy in JSON format that you want to use as an inline session policy. 392 // https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html 393 // +kubebuilder:validation:Pattern:=[\u0009\u000A\u000D\u0020-\u00FF]+ 394 InlinePolicy string `json:"inlinePolicy,omitempty"` 395 396 // The Amazon Resource Names (ARNs) of the IAM managed policies that you want 397 // to use as managed session policies. 398 // The policies must exist in the same account as the role. 399 PolicyARNs []string `json:"policyARNs,omitempty"` 400 } 401 ``` 402 403 <em>AWSClusterControllerIdentity</em> 404 405 Supporting [FR4](#FR4) by restricting the usage of controller credentials only from `allowedNamespaces`. 406 `AWSClusterControllerIdentity` resource will be a singleton and this will be enforced by OpenAPI checks. 407 This instance's creation is automated by a controller for not affecting existing AWSClusters. For details, see [Upgrade Strategy](#upgrade-strategy) 408 409 ```go 410 411 // AWSClusterControllerIdentity represents an intent to use Cluster API Provider AWS Controller credentials for management cluster 412 // and restricts the usage of it by namespaces. 413 414 type AWSClusterControllerIdentity struct { 415 metav1.TypeMeta `json:",inline"` 416 metav1.ObjectMeta `json:"metadata,omitempty"` 417 418 // Spec for this AWSClusterControllerIdentity. 419 Spec AWSClusterControllerIdentitySpec `json:"spec,omitempty""` 420 } 421 422 type AWSClusterControllerIdentitySpec struct { 423 AWSClusterIdentitySpec 424 } 425 ``` 426 427 Example: 428 429 ```yaml 430 --- 431 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 432 kind: AWSCluster 433 metadata: 434 name: "test" 435 namespace: "test" 436 spec: 437 region: "eu-west-1" 438 identityRef: 439 kind: AWSClusterControllerIdentity 440 name: default 441 --- 442 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 443 kind: AWSClusterControllerIdentity 444 metadata: 445 name: "default" 446 spec: 447 allowedNamespaces:{} # matches all namespaces 448 ``` 449 450 451 <em>AWSClusterStaticIdentity</em> 452 453 Supporting [FR3](#FR3). 454 455 ```go 456 457 // AWSClusterStaticIdentity represents a reference to an AWS access key ID and 458 // secret access key, stored in a secret. 459 type AWSClusterStaticIdentity struct { 460 metav1.TypeMeta `json:",inline"` 461 metav1.ObjectMeta `json:"metadata,omitempty"` 462 463 // Spec for this AWSClusterStaticIdentity. 464 Spec AWSClusterStaticIdentitySpec `json:"spec,omitempty""` 465 } 466 467 type AWSClusterSecretReference struct { 468 // Namspace is where the Secret is located 469 Namespace string `json:"string"` 470 // Name is the resource name of the secret 471 Name string `json:"name"` 472 } 473 474 type AWSClusterStaticIdentitySpec struct { 475 AWSClusterIdentitySpec 476 // Reference to a secret containing the credentials. The secret should 477 // contain the following data keys: 478 // AccessKeyID: AKIAIOSFODNN7EXAMPLE 479 // SecretAccessKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 480 // SessionToken: Optional 481 SecretRef AWSClusterSecretReference `json:"secretRef"` 482 } 483 ``` 484 485 Example: 486 487 ```yaml 488 --- 489 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 490 kind: AWSCluster 491 metadata: 492 name: "test" 493 namespace: "test" 494 spec: 495 region: "eu-west-1" 496 identityRef: 497 kind: AWSClusterStaticIdentity 498 name: test-account 499 --- 500 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 501 kind: AWSClusterStaticIdentity 502 metadata: 503 name: "test-account" 504 spec: 505 secretRef: 506 name: test-account-creds 507 namespace: capa-system 508 clusterSelector: 509 allowedNamespaces: 510 selector: 511 matchLabels: 512 ns: "testlabel" 513 --- 514 apiVersion: v1 515 kind: Namespace 516 metadata: 517 labels: 518 cluster.x-k8s.io/ns: "testlabel" 519 name: "test" 520 --- 521 apiVersion: v1 522 kind: Secret 523 metadata: 524 name: "test-account-creds" 525 namespace: capa-system 526 stringData: 527 accessKeyID: AKIAIOSFODNN7EXAMPLE 528 secretAccessKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 529 ``` 530 531 <em>AWSClusterRoleIdentity</em> 532 533 `AWSClusterRoleIdentity` allows CAPA to assume a role either in the same 534 or another AWS account, using the STS::AssumeRole API, meeting [FR 1](#FR1). 535 536 ```go 537 type AWSClusterRoleIdentity struct { 538 metav1.TypeMeta `json:",inline"` 539 metav1.ObjectMeta `json:"metadata,omitempty"` 540 541 // Spec for this AWSClusterRoleIdentity. 542 Spec AWSClusterRoleIdentitySpec `json:"spec,omitempty""` 543 } 544 545 // AWSClusterIdentityKind defines allowed cluster-scoped AWS identity types 546 type AWSClusterIdentityKind AWSIdentityKind 547 548 type AWSClusterIdentityReference struct { 549 Kind AWSClusterIdentityKind `json:"kind"` 550 Name string `json:"name"` 551 } 552 553 type AWSClusterRoleIdentitySpec struct { 554 AWSClusterIdentitySpec 555 AWSRoleSpec 556 // A unique identifier that might be required when you assume a role in another account. 557 // If the administrator of the account to which the role belongs provided you with an 558 // external ID, then provide that value in the ExternalId parameter. This value can be 559 // any string, such as a passphrase or account number. A cross-account role is usually 560 // set up to trust everyone in an account. Therefore, the administrator of the trusting 561 // account might send an external ID to the administrator of the trusted account. That 562 // way, only someone with the ID can assume the role, rather than everyone in the 563 // account. For more information about the external ID, see How to Use an External ID 564 // When Granting Access to Your AWS Resources to a Third Party in the IAM User Guide. 565 // +optional 566 // +kubebuilder:validation:Pattern:=[\w+=,.@:\/-]* 567 ExternalID *string `json:"externalID,omitempty"` 568 569 // SourceIdentityRef is a reference to another identity which will be chained to do 570 // role assumption. 571 SourceIdentityRef AWSClusterIdentityReference `json:"sourceIdentityRef,omitempty"` 572 } 573 574 ``` 575 576 Implementation of `ExternalID` supports [FR2](#FR2). 577 578 Example: 579 580 ```yaml 581 --- 582 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 583 kind: AWSCluster 584 metadata: 585 name: "test" 586 namespace: "test" 587 spec: 588 region: "eu-west-1" 589 identityRef: 590 kind: AWSClusterRoleIdentity 591 name: test-account-role 592 --- 593 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 594 kind: AWSClusterRoleIdentity 595 metadata: 596 name: "test-account-role" 597 spec: 598 allowedNamespaces: 599 list: # allows only "test" namespace to use this identity 600 "test" 601 roleARN: "arn:aws:iam::123456789:role/CAPARole" 602 # Optional settings 603 sessionName: "cluster-spinner.acme.com" 604 externalID: "7a5b816a-7743-4377-a382-2d695bf1f172" 605 policyARNs: '["RestrictedCAPAPolicy"]' 606 inlinePolicy: '{"Version": "2012-10-17","Statement":[...]}' 607 tags: '["company": "acme industries", "product": "cluster-spinner"]' 608 transitiveTags: '["company", "product"]' 609 sourceIdentityRef: 610 kind: AWSClusterStaticIdentity 611 name: test-account-creds 612 ``` 613 614 <em>Future implementation: AWSServiceAccountIdentity</em> 615 616 This would not be implemented in the first instance, but opens the possibility to use Kubernetes service accounts 617 together with `STS::AssumeRoleWithWebIdentity`, supporting [FR7](#FR7). 618 619 Definition: 620 621 ```go 622 type AWSServiceAccountIdentity struct { 623 metav1.TypeMeta `json:",inline"` 624 metav1.ObjectMeta `json:"metadata,omitempty"` 625 626 // Spec for this AWSServiceAccountIdentity. 627 Spec AWSServiceAccountIdentitySpec `json:"spec,omitempty""` 628 } 629 630 type AWSServiceAccountIdentitySpec struct { 631 AWSRoleSpec 632 633 // Audience is the intended audience of the token. A recipient of a token 634 // must identify itself with an identifier specified in the audience of the 635 // token, and otherwise should reject the token. The audience defaults to 636 // sts.amazonaws.com 637 // +default="[sts.amazonaws.com]" 638 Audiences []string 639 640 // ExpirationSeconds is the requested duration of validity of the request. The 641 // token issuer may return a token with a different validity duration so a 642 // client needs to check the 'expiration' field in a response. 643 // +optional 644 // +default=86400 645 ExpirationSeconds int `json:ExpirationSeconds` 646 } 647 ``` 648 649 Because service account subjects are necessarily scoped to a namespace within the Kubernetes RBAC model, it's therefore 650 ideal to leverage a namespace-scoped CRD for permissions. 651 652 Example: 653 654 The account owner would be expected to set up an appropriate IAM role with the following trust policy: 655 656 ```json 657 { 658 "Version": "2012-10-17", 659 "Statement": [ 660 { 661 "Effect": "Allow", 662 "Identity": { 663 "Federated": "<Provider ARN for the management cluster OIDC configuration>" 664 }, 665 "Action": "sts:AssumeRoleWithWebIdentity", 666 "Condition": { 667 "StringEquals": { 668 "sub": "system:serviceaccount:test:test-service-account", 669 "audience": "sts.amazonaws.com", 670 } 671 } 672 } 673 ] 674 } 675 ``` 676 677 and then apply the following: 678 679 ```yaml 680 --- 681 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 682 kind: AWSCluster 683 metadata: 684 name: "test" 685 namespace: "test" 686 spec: 687 region: "eu-west-1" 688 identityRef: 689 kind: AWSServiceAccountIdentity 690 name: test-service-account 691 --- 692 apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 693 kind: AWSServiceAccountIdentity 694 metadata: 695 name: "test-service-account" 696 namespace: "test" 697 spec: 698 audiences: 699 - "sts.amazonaws.com" 700 roleARN: "arn:aws:iam::123456789:role/CAPARole" 701 ``` 702 703 With the CAPA controller running with the appropriate permissions to create service accounts in the `"test"` namespace, 704 the controller would then request a POST request to `/api/v1/namespaces/test/serviceaccounts/test-service-account/token` 705 with the requisite parameters and receive a new JWT token to use with `STS::AssumeRoleWithWebIdentity`. 706 707 ### Controller Changes 708 709 * If identityRef is specified, the CRD is fetched and unmarshalled into a AWS SDK credential.Provider for the identity type. 710 * The controller will compare the hash of the credential provider against the same secret’s provider in a cache ([NFR 8](#NFR8)). 711 * The controller will take the newer of the two and instantiate AWSClients with the selected credential provider. 712 * The controller will set an the identity resource as one of the OwnerReferences of the AWSCluster. 713 * The controller and defaulting webhook will default `nil` `identityRef` field in AWSClusters to `AWSClusterControllerIdentity`. 714 715 This flow is shown below: 716 717  718 719 ### Clusterctl changes 720 721 Today, `clusterctl move` operates by tracking objectreferences within the same namespace, since we are now proposing to 722 use cluster-scoped resources, we will need to add requisite support to clusterctl's object graph to track ownerReferences 723 pointing at cluster-scoped resources, and ensure they are moved. We will naively not delete cluster-scoped resources 724 during a move, as they maybe referenced across namespaces. Because we will be tracking the AWS Account ID for a given identity, it is expected, that for this proposal, this provides sufficient protection against the possibility of a cluster-scoped 725 identity after one move, and being copied again. 726 727 728 #### Validating webhook changes 729 730 A validating webhook could potentially handle some of the cross-resource validation necessary for the [security 731 model](#security-model) and provide more immediate feedback to end users. However, it would be imperfect. For example, a 732 change to a `AWSCluster*Identity` could affect the validity of corresponding AWSCluster. 733 The singleton `AWSClusterControllerIdentity` resource will be immutable to avoid any unwanted overrides to the allowed namespaces, especially during upgrading clusters. 734 735 #### Identity Type Credential Provider Behaviour 736 737 Implementations for all identity types will implement the `credentials.Provider` interface in the AWS SDK to support [FR5](#FR5) as well as an 738 additional function signature to support caching: 739 740 ```go 741 type Provider interface { 742 // Retrieve returns nil if it successfully retrieved the value. 743 // Error is returned if the value were not obtainable, or empty. 744 Retrieve() (Value, error) 745 746 // IsExpired returns if the credentials are no longer valid, and need 747 // to be retrieved. 748 IsExpired() bool 749 } 750 751 type AWSIdentityTypeProvider interface { 752 credentials.Provider 753 // Hash returns a unique hash of the data forming the credentials 754 // for this identity 755 Hash() (string, error) 756 } 757 ``` 758 759 AWS client sessions are structs implementing the provider interface. Every SDK call will call `IsExpired()` on the 760 credential provider. If `IsExpired()` returns true, `Retrieve()` is called to refresh the credential. Where the controller 761 is using a temporary credential, the “real” credential provider will be nested, such that `IsExpired()` calls `IsExpired()` 762 on the nested provider. This also allows the end user to chain role identities together if needed. 763 764 The controller will maintain a cache of all the custom credential providers that are referenced by an AWSCluster, 765 similar as is done to sessions at present. 766 767 In the first instance, the controller will refresh secrets on each reconciliation run, and store a hash of the secret 768 contents in the cache. If the hash changes, then the credential will be invalidated, forcing `IsExpired()` to return 769 true and a consuming EC2, ELB or SecretsManager client will therefore issue a `Retrieve()` on the credential provider 770 forcing a refresh. 771 772 The authors have implemented a similar mechanism [based on the Ruby AWS SDK][aws-assume-role]. 773 774 This could be further optimised by having the controller maintain a watch on all Secrets matching the identity types. 775 Upon receiving an update event, the controller will update lookup the key in the cache and update the relevant provider. 776 This may be implemented as its own interface. Mutexes will ensure in-flight updates are completed prior to SDK calls are 777 made. This would require changes to RBAC, and maintaining a watch on secrets of a specific type will require further 778 investigation as to feasibility. 779 780 ### Security Model 781 782 The intended RBAC model mirrors that for Service APIs: 783 784 #### Roles 785 786 For the purposes of this security model, 3 common roles have been identified: 787 788 * **Infrastructure provider**: The infrastructure provider (infra) is responsible for the overall environment that 789 the cluster(s) are operating in or the PaaS provider in a company. 790 791 * **Management cluster operator**: The cluster operator (ops) is responsible for 792 administration of the Cluster API management cluster. They manage policies, network access, 793 application permissions. 794 795 * **Workload cluster operator**: The workload cluster operator (dev) is responsible for 796 management of the cluster relevant to their particular applications . 797 798 There are two primary components to the Service APIs security model: RBAC and namespace restrictions. 799 800 ### RBAC 801 RBAC (role-based access control) is the standard used for Kubernetes 802 authorization. This allows users to configure who can perform actions on 803 resources in specific scopes. RBAC can be used to enable each of the roles 804 defined above. In most cases, it will be desirable to have all resources be 805 readable by most roles, so instead we'll focus on write access for this model. 806 807 ##### Write Permissions 808 | | AWSCluster*Identity | AWSServiceAccountIdentity | AWS IAM API | Cluster | 809 | ---------------------------- | ------------------- | ------------------------- | ----------- | ------- | 810 | Infrastructure Provider | Yes | Yes | Yes | Yes | 811 | Management Cluster Operators | Yes | Yes | Yes | Yes | 812 | Workload Cluster Operator | No | Yes | No | Yes | 813 814 Because Kubernetes service accounts necessarily encode the namespace into the JWT subject, we can allow workload cluster 815 operators to create their own `AWSServiceAccountIdentities`. Whether they have actual permissions on AWS IAM to set up 816 the trust policy and the management cluster as the federated identity provider then becomes a problem external to 817 Kubernetes. 818 819 ### Namespace Restrictions 820 The extra configuration options are not possible to control with RBAC. Instead, 821 they will be controlled with configuration fields on GatewayClasses: 822 823 * **allowedNamespaces**: This field is a selector of namespaces that 824 Gateways can use this `AWSCluster*Identity` from. This is a standard Kubernetes 825 LabelSelector, a label query over a set of resources. The result of 826 matchLabels and matchExpressions are ANDed. CAPA will not support 827 AWSClusters in namespaces outside this selector. An empty selector (default) 828 indicates that AWSCluster can use this `AWSCluster*Identity` from any namespace. This 829 field is intentionally not a pointer because the nil behavior (no namespaces) 830 is undesirable here. 831 832 833 ### CAPA Controller Requirements 834 The CAPA controller will need to: 835 836 * Populate condition fields on AWSClusters and indicate if it is 837 compatible with `AWS*Identity`. 838 * Not implement invalid configuration. Fore example, if a `AWSCluster*Identity` is referenced in 839 an invalid namespace for it, it should be ignored. 840 * Respond to changes in `AWS*Identity` configuration that may change. 841 842 ### Alternative Approaches Considered 843 844 #### Using only secrets or RoleARN field on AWSCluster 845 846 In an earlier iteration of this proposal, it was proposed to only use secretReferences in the `AWSCluster` object, and 847 then using the `Type` field on the Secret to disambiguate between the different orincipal types. 848 849 **Benefits** 850 851 * Re-using secrets ensures encryption by default where it is KMS encryption is used, and additionally provides a clear UX 852 signal to end users that the data is meant to be kept secure. 853 854 <em>Mitigations for current proposal</em> 855 856 This is traded off in this proposal with ensuring static credentials are stored in secrets, whilst 857 allowing role ARNs and other parameters to be stored in CRDs. 858 859 Where cluster operators are using encryption providers for the Kubernetes API Server, they can 860 optionally specify that these new resources are encrypted at rest through configuration of the 861 API server KMS. 862 863 **Downsides** 864 865 By allowing workload cluster operators to create the various identitys, or referencing them directly in the 866 AWSCluster as a field they could potentially escalate privilege when assuming role across AWS accounts as the CAPA 867 controller itself may be running with privileged trust. 868 869 #### 1:1 mapping one namespace to one AWSIdentity 870 871 The mapping of a singular AWSIdentity to a single namespace such that there is a 1:1 mapping 872 was considered, via either some implicitly named secret or other metadata on the namespace. 873 874 **Benefits** 875 876 * Can potentially resolve an issue where multiple clusters with the same name in an AWS account cause problems for the [AWS Cloud Provider](#aws-cloud-provider-behaviour-with-regards-to-cluster-names). 877 878 **Downsides** 879 880 * Implicit link of credentials to clusters is harder for introspection. 881 * Doesn't necessarily guarantee uniqueness of cluster names in an account 882 883 ### Risks and Mitigations 884 885 #### Network assumptions are made explicit 886 887 This change maintains but makes explicit the expectation that there is network access between the account that the CAPI 888 operator resides in to an API server endpoint in an account where a reconciled cluster resides. Existing pre-flight 889 checks would not confirm this. The existing pattern for the CAPA operator to create security groups if they do not exist 890 may need to account for this network access requirement. Currently, when CAPA creates a security group for cluster 891 control plane load balancer it allows ingress from any CIDR block. However the security groups constraining the CAPI 892 operator would require appropriate egress rules to be able to access load balancers in other AWS accounts. The extent to 893 which CAPA can solve for this needs to be determined. However, this risk has already been present and exposed doing role 894 assumption using another method such as Kiam. 895 896 #### Caching and handling refresh of credentials 897 898 For handling many accounts, the number of calls to the STS service must be minimised. This is currently implemented as a 899 cache on cluster key in Cluster API Provider AWS. 900 901 #### AWS Cloud Provider behaviour with regards to cluster names 902 903 At present, multiple clusters deployed into the same AWS account with the same 904 cluster name cause the AWS Cloud Provider to malfunction due to the way 905 tags are looked up. This has driven some concern to allowing multiple 906 namespaces access to the same AWS account. 907 908 However, since it is possible to configure multiple credentials against the same 909 account, there's no easy way to guarantee uniqueness. 910 911 This could be mitigated by having CAPA call `STS::GetCallerIdentity` on entry 912 to the cluster, and then comparing against all current clusters declared across 913 the management cluster. We would probably need to store the AccountID into 914 the AWSCluster object and index the field. 915 916 A better approach maybe to modify the cloud provider to support customisation 917 of tag lookups such that multiple clusters with the same name, but in 918 different namespaces do not break the AWS Cloud Provider. 919 920 ## Upgrade Strategy 921 922 The data changes are additive and optional, except `AWSClusterControllerIdentity`. 923 `AWSClusterControllerIdentity` singleton instance restricts the usage of controller credentials only from `allowedNamespaces`. 924 AWSClusters that do not have an assigned `IdentityRef` is defaulted to use `AWSClusterControllerIdentity`, hence existing clusters needs to have 925 `AWSClusterControllerIdentity` instance. In order to make existing AWSClusters to continue to reconcile as before, a new controller is added as experimental feature 926 and gated with **Feature gate:** AutoControllerIdentityCreator=true. By default, this feature is enabled. This controller creates `AWSClusterControllerIdentity` singleton instance (if missing) that allows all namespaces to use the identity. 927 During v1alpha4 releases, since breaking changes will be allowed, this feature will become obsolete. 928 929 ## Additional Details 930 931 ### Test Plan 932 933 * Unit tests to validate that the cluster controller can reconcile an AWSCluster when IAMRoleARN field is nil, or provided. 934 * Unit tests to ensure pre-flight checks are performed relating to IAM role assumption when IAMRoleARN is provided. 935 * Propose performing an initial sts:AssumeRole call and fail pre-flight if this fails. 936 * e2e test for role assumption ([NFR12](#NFR12)). 937 * If it can be supported in Prow environment, additional e2e test for OIDC-based role assumption. 938 * clusterctl e2e test with a move of a self-hosted cluster using a identityRef. 939 940 ### Graduation Criteria 941 942 #### Alpha 943 944 * Support role assumption with external ID as a aws-sdk-go `credentials.Provider`. 945 * Ensure `clusterctl move` works with the mechanism. 946 947 #### Beta 948 949 * Support OIDC and chained role assumption. 950 * Admission controller validation for secrets of type. 951 * Full e2e coverage. 952 953 #### Stable 954 955 * Two releases since beta. 956 957 These may be defined in terms of API maturity, or as something else. Initial proposal should keep this high-level with a 958 focus on what signals will be looked at to determine graduation. 959 960 ### Version Skew Strategy 961 962 Most of the version skew is contained within the secret type. We explicitly add a `v1alpha3` to the secret type allowing 963 the format to change over time. However, we do not have the benefit of hub and spoke storage conversion that would exist 964 with a CRD. We propose, that in the event of a secret type signature change: 965 966 * The old version is accepted for 2 releases. 967 * For the n+3 and n+4 releases, the Cluster API Provider AWS controller will convert the storage version. 968 * On release n+5, remaining code for the old type signatures are deleted. 969 970 ## Implementation History 971 972 - [ ] 2020/03/30: Initial proposal 973 - [ ] 2020/04/02: Revised proposal as [Google doc][google-doc] 974 - [ ] 2020/05/01: Presented proposal at meeting with stakeholders 975 - [ ] 2020/05/06: Open proposal PR 976 977 <!-- Links --> 978 [aws-sdk-go-credential-provider]: https://github.com/aws/aws-sdk-go/blob/master/aws/credentials/credentials.go#L35 979 [kiam]: https://github.com/uswitch/kiam 980 [aws-assume-role]: https://github.com/scalefactory/aws-assume-role 981 [google-doc]: https://docs.google.com/document/d/1vwjvWc-RIZfwDXb-3tW4gKBOaPCKhripfhXshOmB5z4/edit#heading=h.13bi7vgop9tn