sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/proposal/20181010-aws-resource-handling.md

sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/proposal/20181010-aws-resource-handling.md (about)

     1  # AWS Resource Handling
     2  
     3  ## Problem
     4  
     5  Since the AWS APIs do not provide consistent support for tagging resources on creation or idempotent operations, we need to ensure that we can provide as much of an external guarantee to users as we can. Otherwise we risk creating and orphaning resources on users leading to extraneous costs and the potential to exhaust resource quotas prematurely.
     6  
     7  ## Existing solutions and drawbacks for our use
     8  
     9  ### kops
    10  
    11  kops relies heavily on tagging of resources. This provides the benefit of recreating the state of the cluster resources by querying with filters. However, in many cases tagging requires a second API call and there are some resources that are not able to be tagged. If we succeed in creating a resource but fail to tag the resource, then we risk the chance of orphaning that resource for the user.
    12  
    13  ### Kubicorn
    14  
    15  In contrast to the kops approach, Kubicorn mainly relies on recording the resource IDs as part of state. However, since we rely on an external API server for recording the state there is still a possibility of creating the resource and failing to record the resource ID, which still exposes the possible risk of orphaning that resource for the user.
    16  
    17  ## Summary of edge cases for creating an individual resource
    18  
    19  1. resource create succeeds, but subsequent tagging fails
    20  2. resource creates succeeds, but update of cluster/machine object fails
    21  3. attempting to delete resource fails after an attempt to rollback due to a failure to record the ID of the created resource to the cluster/machine object for resources that do not support tagging on create.
    22  4. the controller/actuator dies after creating a resource but before tagging and or recording the resource
    23  
    24  ## Misc TODOs
    25  
    26  - Solicit feedback on whether aws-sdk-go and client-go retry defaults are sufficient:
    27    - aws-sdk-go
    28      - https://docs.aws.amazon.com/general/latest/gr/api-retries.html
    29      - https://docs.aws.amazon.com/sdk-for-go/api/aws/client/#DefaultRetryer
    30    - client-go
    31      - https://github.com/kubernetes/client-go/blob/master/rest/request.go#L659
    32  - Identify which resources fall into which class of workflow based on tagging support, client token support, etc.
    33  - Better define mutatable/non-mutatable attributes for objects during update
    34  
    35  ## Using client tokens
    36  
    37  Where possible use [client tokens](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html) in the create request so that subsequent requests will return the same response.
    38  
    39  ## Tagging of resources
    40  
    41  Resources handled by these components fall into one of three categories:
    42  
    43  1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider-aws/cluster/<name or id>=owned`, and the actuator is expected to keep these resources as closely in sync with the spec as possible.
    44  2. Resources whose management is shared with the in-cluster aws cloud provider, such as a security group for load balancer ingress rules. These resources should be tagged with `sigs.k8s.io/cluster-api-provider-aws/cluster/<name or id>=owned` and `kubernetes.io/cluster/<name or id>=owned`, with the latter being the tag defined by the cloud provider. These resources are create/delete only: that is to say their ongoing management is "handed off" to the cloud provider.
    45  3. Unmanaged resources that are provided by config (such as a common VPC). The provider will avoid changing these resources as much as is possible.
    46  
    47  TODO: Define additional tags that can be used to provide additional metadata about the resource configuration/usage by the actuator. This is would allow us to rebuild status without relying on polluting the object config.
    48  
    49  ## Handling of AWS api errors
    50  
    51  Each resource has specific error codes that it will return and these can be used to differentiate fatal errors from retryable errors. These errors are well documented in some cases (elbv2 api), and poorly in others (ec2 api). We should provide a best effort to [properly handle these errors](https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/handling-errors.html) in the correct manner.
    52  
    53  ## Proposed workflows
    54  
    55  ### Resources that support tag on create support (with or without client tokens)
    56  
    57  #### Create
    58  
    59  - Query resource by tags to determine if resource already exists
    60  - Create the resource if it doesn't already exist
    61  - Update the cluster/machine object config and status
    62    - If update fails return a retryable error to requeue the create
    63  - Enqueue cluster/machine update if not already available/ready
    64  
    65  TODO: flowchart
    66  
    67  #### Update
    68  
    69  - Query resource by ID
    70  - Update object status
    71  - Enqueue cluster/machine update if not already available/ready
    72  
    73  #### Edge case coverage
    74  
    75  1. Yes - tagging is handled on creation
    76  2. Yes - since resources are tagged on creation, returning an error and requeueing the create will find the tagged resource and attempt to retry the object update.
    77  3. Yes - there is no delete attempt since we can re-query the resource by tags.
    78  4. Yes - the next attempt to create the resource will find the already created resource by tags.
    79  
    80  ### Resources that support client tokens but require separate tagging
    81  
    82  #### Create
    83  
    84  - Create the resource using object uid as the client token
    85  - Update the cluster/machine object config and status
    86    - If update fails return a retryable error to requeue the create
    87  - Tag AWS resource
    88  - Enqueue cluster/machine update if not already available/ready
    89  
    90  TODO: flowchart
    91  
    92  #### Update
    93  
    94  - Query resource by ID
    95  - tag resource if missing tags
    96  - Update object status
    97  - Enqueue cluster/machine update if not already available/ready
    98  
    99  #### Edge case coverage
   100  
   101  1. Yes - If the update was successful, tagging will be reconciled the next time the object is updated on reconiliation. If the update was not successful, the next call using the same client token will return the same object as previously created.
   102  2. Yes - since we are using a client token, subsequent requests will return the same result.
   103  3. Yes - there is no delete attempt since we can repeat the request to create the resource safely.
   104  4. Yes - the next attempt to create the resource will return the already created resource.
   105  
   106  ### Resources that require separate tagging without client token support
   107  
   108  #### Create - option 1
   109  
   110  - Create resource
   111  - Update cluster/machine object config and status
   112    - If update fails attempt delete of created resource
   113      - If delete fails log delete failure and return non-retryable error
   114  - Tag AWS resource
   115  - Enqueue cluster/machine update if not already available/ready or tagging fails
   116  
   117  TODO: flowchart
   118  
   119  #### Update - option 1
   120  
   121  - Query resource by ID
   122  - tag resource if missing tags
   123  - Update object status
   124  - Enqueue cluster/machine update if not already available/ready
   125  
   126  #### Edge case coverage - option 1
   127  
   128  1. Yes - Since the resource ID is already recorded, the update process will reconcile missing tags
   129  2. Yes, with caveat - If the object update fails, we attempt to rollback the creation but edge case 3 comes into play
   130  3. Minor mitigation - If we fail to delete the resource, we will still orphan the resource, but output a log message for querying/followup and return a non-retryable error
   131  4. Minor mitigation - If the process dies before recording the ID the resource is orphaned. If the process dies after recording the ID, but before tagging it is reconciled through update.
   132  
   133  #### Create - option 2
   134  
   135  - Query resource by tags to determine if resource already exists
   136  - Create the resource if it doesn't already exist
   137  - Update cluster/machine object config and status
   138    - Note failure but do not return error yet
   139  - Tag AWS resource if needed
   140    - If both update and tagging fails, delete resource
   141      - If delete fails log failure prominently, return non-retryable error
   142    - If only tagging fails, return retryable error
   143  - If update failed, return retryable error
   144  - Enqueue cluster/machine update if not already available/ready or tagging fails
   145  
   146  TODO: flowchart
   147  
   148  #### Update - option 2
   149  
   150  - Query resource by ID
   151  - Update object status
   152  - Enqueue cluster/machine update if not already available/ready
   153  
   154  #### Edge case coverage - option 2
   155  
   156  1. Yes - If only the tagging fails, then we will reconcile tags on update. If the update also fails, then we attempt to delete the resource.
   157  2. Yes, with caveat - If only the object update fails, then we throw a retryable error that will requeue the create operation and attempt to update the object after discovering the existing resource. If the tagging fails as well, then we attempt to delete the resource and edge case 3 will still apply.
   158  3. Minor mitigation - If we fail to delete the resource, we will still orphan the resource, but output a log message for querying/followup and return a non-retryable error
   159  4. Minor mitigation - If the process dies before recording the ID the resource is orphaned. If the process dies after recording the ID, but before tagging it is reconciled through update.
   160  
   161  ### Resources without tag support with client token support
   162  
   163  #### Create
   164  
   165  - Create the resource using the object uid as the client token
   166  - Update cluster/machine object config and status
   167  - Enqueue cluster/machine update if not already available/ready or tagging fails
   168  
   169  TODO: flowchart
   170  
   171  #### Update
   172  
   173  - Query resource by ID
   174  - Update object status
   175  - Enqueue cluster/machine update if not already available/ready
   176  
   177  #### Edge case coverage
   178  
   179  1. Yes - There is no tagging
   180  2. Yes - If the update fails, subsequent calls will return the same resource.
   181  3. Yes - No delete is used
   182  4. Yes - If the process dies before recording the ID subsequent calls to create the resource return the same resource.
   183  
   184  ### Resources without tag support without client token support
   185  
   186  #### Create
   187  
   188  - Create the resource
   189  - Update cluster/machine object config and status
   190    - If update fails, delete resource
   191      - If delete fails log failure prominently, return non-retryable error
   192  - Enqueue cluster/machine update if not already available/ready or tagging fails
   193  
   194  TODO: flowchart
   195  
   196  #### Update
   197  
   198  - Query resource by ID
   199  - Update object status
   200  - Enqueue cluster/machine update if not already available/ready
   201  
   202  #### Edge case coverage
   203  
   204  1. Yes - There is no tagging
   205  2. Yes, with caveat - If the update fails, then we attempt to delete the resource and edge case 3 will still apply.
   206  3. Minor mitigation - If we fail to delete the resource, we will still orphan the resource, but output a log message for querying/followup and return a non-retryable error
   207  4. No - If the process dies before recording the ID the resource is orphaned.