sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/proposal/20181010-aws-resource-handling.md (about) 1 # AWS Resource Handling 2 3 ## Problem 4 5 Since the AWS APIs do not provide consistent support for tagging resources on creation or idempotent operations, we need to ensure that we can provide as much of an external guarantee to users as we can. Otherwise we risk creating and orphaning resources on users leading to extraneous costs and the potential to exhaust resource quotas prematurely. 6 7 ## Existing solutions and drawbacks for our use 8 9 ### kops 10 11 kops relies heavily on tagging of resources. This provides the benefit of recreating the state of the cluster resources by querying with filters. However, in many cases tagging requires a second API call and there are some resources that are not able to be tagged. If we succeed in creating a resource but fail to tag the resource, then we risk the chance of orphaning that resource for the user. 12 13 ### Kubicorn 14 15 In contrast to the kops approach, Kubicorn mainly relies on recording the resource IDs as part of state. However, since we rely on an external API server for recording the state there is still a possibility of creating the resource and failing to record the resource ID, which still exposes the possible risk of orphaning that resource for the user. 16 17 ## Summary of edge cases for creating an individual resource 18 19 1. resource create succeeds, but subsequent tagging fails 20 2. resource creates succeeds, but update of cluster/machine object fails 21 3. attempting to delete resource fails after an attempt to rollback due to a failure to record the ID of the created resource to the cluster/machine object for resources that do not support tagging on create. 22 4. the controller/actuator dies after creating a resource but before tagging and or recording the resource 23 24 ## Misc TODOs 25 26 - Solicit feedback on whether aws-sdk-go and client-go retry defaults are sufficient: 27 - aws-sdk-go 28 - https://docs.aws.amazon.com/general/latest/gr/api-retries.html 29 - https://docs.aws.amazon.com/sdk-for-go/api/aws/client/#DefaultRetryer 30 - client-go 31 - https://github.com/kubernetes/client-go/blob/master/rest/request.go#L659 32 - Identify which resources fall into which class of workflow based on tagging support, client token support, etc. 33 - Better define mutatable/non-mutatable attributes for objects during update 34 35 ## Using client tokens 36 37 Where possible use [client tokens](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html) in the create request so that subsequent requests will return the same response. 38 39 ## Tagging of resources 40 41 Resources handled by these components fall into one of three categories: 42 43 1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider-aws/cluster/<name or id>=owned`, and the actuator is expected to keep these resources as closely in sync with the spec as possible. 44 2. Resources whose management is shared with the in-cluster aws cloud provider, such as a security group for load balancer ingress rules. These resources should be tagged with `sigs.k8s.io/cluster-api-provider-aws/cluster/<name or id>=owned` and `kubernetes.io/cluster/<name or id>=owned`, with the latter being the tag defined by the cloud provider. These resources are create/delete only: that is to say their ongoing management is "handed off" to the cloud provider. 45 3. Unmanaged resources that are provided by config (such as a common VPC). The provider will avoid changing these resources as much as is possible. 46 47 TODO: Define additional tags that can be used to provide additional metadata about the resource configuration/usage by the actuator. This is would allow us to rebuild status without relying on polluting the object config. 48 49 ## Handling of AWS api errors 50 51 Each resource has specific error codes that it will return and these can be used to differentiate fatal errors from retryable errors. These errors are well documented in some cases (elbv2 api), and poorly in others (ec2 api). We should provide a best effort to [properly handle these errors](https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/handling-errors.html) in the correct manner. 52 53 ## Proposed workflows 54 55 ### Resources that support tag on create support (with or without client tokens) 56 57 #### Create 58 59 - Query resource by tags to determine if resource already exists 60 - Create the resource if it doesn't already exist 61 - Update the cluster/machine object config and status 62 - If update fails return a retryable error to requeue the create 63 - Enqueue cluster/machine update if not already available/ready 64 65 TODO: flowchart 66 67 #### Update 68 69 - Query resource by ID 70 - Update object status 71 - Enqueue cluster/machine update if not already available/ready 72 73 #### Edge case coverage 74 75 1. Yes - tagging is handled on creation 76 2. Yes - since resources are tagged on creation, returning an error and requeueing the create will find the tagged resource and attempt to retry the object update. 77 3. Yes - there is no delete attempt since we can re-query the resource by tags. 78 4. Yes - the next attempt to create the resource will find the already created resource by tags. 79 80 ### Resources that support client tokens but require separate tagging 81 82 #### Create 83 84 - Create the resource using object uid as the client token 85 - Update the cluster/machine object config and status 86 - If update fails return a retryable error to requeue the create 87 - Tag AWS resource 88 - Enqueue cluster/machine update if not already available/ready 89 90 TODO: flowchart 91 92 #### Update 93 94 - Query resource by ID 95 - tag resource if missing tags 96 - Update object status 97 - Enqueue cluster/machine update if not already available/ready 98 99 #### Edge case coverage 100 101 1. Yes - If the update was successful, tagging will be reconciled the next time the object is updated on reconiliation. If the update was not successful, the next call using the same client token will return the same object as previously created. 102 2. Yes - since we are using a client token, subsequent requests will return the same result. 103 3. Yes - there is no delete attempt since we can repeat the request to create the resource safely. 104 4. Yes - the next attempt to create the resource will return the already created resource. 105 106 ### Resources that require separate tagging without client token support 107 108 #### Create - option 1 109 110 - Create resource 111 - Update cluster/machine object config and status 112 - If update fails attempt delete of created resource 113 - If delete fails log delete failure and return non-retryable error 114 - Tag AWS resource 115 - Enqueue cluster/machine update if not already available/ready or tagging fails 116 117 TODO: flowchart 118 119 #### Update - option 1 120 121 - Query resource by ID 122 - tag resource if missing tags 123 - Update object status 124 - Enqueue cluster/machine update if not already available/ready 125 126 #### Edge case coverage - option 1 127 128 1. Yes - Since the resource ID is already recorded, the update process will reconcile missing tags 129 2. Yes, with caveat - If the object update fails, we attempt to rollback the creation but edge case 3 comes into play 130 3. Minor mitigation - If we fail to delete the resource, we will still orphan the resource, but output a log message for querying/followup and return a non-retryable error 131 4. Minor mitigation - If the process dies before recording the ID the resource is orphaned. If the process dies after recording the ID, but before tagging it is reconciled through update. 132 133 #### Create - option 2 134 135 - Query resource by tags to determine if resource already exists 136 - Create the resource if it doesn't already exist 137 - Update cluster/machine object config and status 138 - Note failure but do not return error yet 139 - Tag AWS resource if needed 140 - If both update and tagging fails, delete resource 141 - If delete fails log failure prominently, return non-retryable error 142 - If only tagging fails, return retryable error 143 - If update failed, return retryable error 144 - Enqueue cluster/machine update if not already available/ready or tagging fails 145 146 TODO: flowchart 147 148 #### Update - option 2 149 150 - Query resource by ID 151 - Update object status 152 - Enqueue cluster/machine update if not already available/ready 153 154 #### Edge case coverage - option 2 155 156 1. Yes - If only the tagging fails, then we will reconcile tags on update. If the update also fails, then we attempt to delete the resource. 157 2. Yes, with caveat - If only the object update fails, then we throw a retryable error that will requeue the create operation and attempt to update the object after discovering the existing resource. If the tagging fails as well, then we attempt to delete the resource and edge case 3 will still apply. 158 3. Minor mitigation - If we fail to delete the resource, we will still orphan the resource, but output a log message for querying/followup and return a non-retryable error 159 4. Minor mitigation - If the process dies before recording the ID the resource is orphaned. If the process dies after recording the ID, but before tagging it is reconciled through update. 160 161 ### Resources without tag support with client token support 162 163 #### Create 164 165 - Create the resource using the object uid as the client token 166 - Update cluster/machine object config and status 167 - Enqueue cluster/machine update if not already available/ready or tagging fails 168 169 TODO: flowchart 170 171 #### Update 172 173 - Query resource by ID 174 - Update object status 175 - Enqueue cluster/machine update if not already available/ready 176 177 #### Edge case coverage 178 179 1. Yes - There is no tagging 180 2. Yes - If the update fails, subsequent calls will return the same resource. 181 3. Yes - No delete is used 182 4. Yes - If the process dies before recording the ID subsequent calls to create the resource return the same resource. 183 184 ### Resources without tag support without client token support 185 186 #### Create 187 188 - Create the resource 189 - Update cluster/machine object config and status 190 - If update fails, delete resource 191 - If delete fails log failure prominently, return non-retryable error 192 - Enqueue cluster/machine update if not already available/ready or tagging fails 193 194 TODO: flowchart 195 196 #### Update 197 198 - Query resource by ID 199 - Update object status 200 - Enqueue cluster/machine update if not already available/ready 201 202 #### Edge case coverage 203 204 1. Yes - There is no tagging 205 2. Yes, with caveat - If the update fails, then we attempt to delete the resource and edge case 3 will still apply. 206 3. Minor mitigation - If we fail to delete the resource, we will still orphan the resource, but output a log message for querying/followup and return a non-retryable error 207 4. No - If the process dies before recording the ID the resource is orphaned.