sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/proposals/20210716-async-azure-resource-creation-deletion.md

sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/proposals/20210716-async-azure-resource-creation-deletion.md (about)

     1  ---
     2  title: Async Azure Resource Creation and Deletion
     3  authors:
     4      - @CecileRobertMichon
     5      - @devigned
     6  reviewers:
     7      - TBD
     8  creation-date: 2021-07-16
     9  last-updated: 2021-07-26
    10  status: implementable
    11  see-also:
    12      - https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1181
    13      - https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/1067
    14  ---
    15  
    16  # Async Azure Resource Creation and Deletion
    17  
    18  ## <a name='TableofContents'></a>Table of Contents
    19  
    20  <!-- vscode-markdown-toc -->
    21  * [Table of Contents](#TableofContents)
    22  * [Summary](#Summary)
    23  * [Motivation](#Motivation)
    24  	* [Goals](#Goals)
    25  	* [Non-Goals / Future Work](#Non-GoalsFutureWork)
    26  * [Proposal](#Proposal)
    27  	* [User Stories](#UserStories)
    28  		* [Story 1 - UX of creating an AzureCluster](#Story1-UXofcreatinganAzureCluster)
    29  		* [Story 2 - Scaling up a MachineDeployment](#Story2-ScalingupaMachineDeployment)
    30  		* [Story 3 - Deleting an individual Azure Machine Pool Machine](#Story3-DeletinganindividualAzureMachinePoolMachine)
    31  	* [Implementation Details/Notes/Constraints](#ImplementationDetailsNotesConstraints)
    32  	* [Proposed API Changes](#ProposedAPIChanges)
    33  	* [Proposed Controller Changes](#ProposedControllerChanges)
    34  		* [Context timeouts](#Contexttimeouts)
    35  		* [Service Reconcile](#ServiceReconcile)
    36  		* [Service Delete](#ServiceDelete)
    37  		* [AzureCluster Reconcile](#AzureClusterReconcile)
    38  	* [Proposed New Conditions](#ProposedNewConditions)
    39  	* [Open Questions](#OpenQuestions)
    40  		* [1. What should the timeout durations be?](#Whatshouldthetimeoutdurationsbe)
    41  * [Alternatives](#Alternatives)
    42  	* [Parallel reconciliation of Azure services](#ParallelreconciliationofAzureservices)
    43  		* [Pros](#Pros)
    44  		* [Cons](#Cons)
    45  		* [Conclusion](#Conclusion)
    46  * [Additional Details](#AdditionalDetails)
    47  	* [Test Plan](#TestPlan)
    48  * [Implementation History](#ImplementationHistory)
    49  
    50  <!-- vscode-markdown-toc-config
    51  	numbering=false
    52  	autoSave=false
    53  	/vscode-markdown-toc-config -->
    54  <!-- /vscode-markdown-toc -->
    55  
    56  ## <a name='Summary'></a>Summary
    57  
    58  CAPZ reconcilers currently call Azure and wait for each operation before proceeding. We should create/update/delete Azure resources asynchronously, especially for operations that take a long time to complete, such as Virtual Machine creation and deletion.
    59  
    60  ## <a name='Motivation'></a>Motivation
    61  
    62  Blocking on success is sometimes the right thing to do but most of the time, it's the equivalent of the UI freezing on an app because you have used the UI thread to fetch some data causing your user to wonder why and when the software will react. This proposal aims to make the reaction time of the CAPZ controller drastically faster and possibly, more resilient.
    63  
    64  ### <a name='Goals'></a>Goals
    65  
    66  - Accelerate the feedback loop with the user so they can know that reconciliation is progressing without having to go check the Azure resources in the portal/CLI/etc.
    67  - Make the time for the controller to react to a change much faster
    68  - Improve the resiliency of the controller by making it more fault tolerant
    69  - Make it easier for the user to understand the state of each resource by adding more granular Conditions
    70  - Apply the same asynchronous pattern to all resources
    71  
    72  ### <a name='Non-GoalsFutureWork'></a>Non-Goals / Future Work
    73  
    74  - Increase or decrease overall duration of reconciliation
    75  - Increase the number of API calls to Azure
    76  - Start Azure operations for an AzureCluster, AzureMachine, or AzureMachinePool in parallel
    77  - Predict how long each operation will take
    78  
    79  ## <a name='Proposal'></a>Proposal
    80  
    81  ### <a name='UserStories'></a>User Stories
    82  
    83  #### <a name='Story1-UXofcreatinganAzureCluster'></a>Story 1 - UX of creating an AzureCluster
    84  
    85  Blake is a Program Manager trying out Cluster API for the first time. Blake is following the quickstart documentation in the Cluster API book and using Azure to create a cluster. Blake applies the cluster template on the management cluster and describes the resulting AzureCluster resource. The AzureCluster is in "Creating" state and the Conditions get updated as Azure resources are created to show the progress.
    86  
    87  #### <a name='Story2-ScalingupaMachineDeployment'></a>Story 2 - Creating AzureMachines concurrently
    88  
    89  Alex is an engineer in a large organization which has a MachineDeployment running. Alex needs to scale up the number of replicas of the MachineDeployment by 200. Alex uses `kubectl` to scale the number of replicas in the MachineDeployment by 200. The AzureMachine controller in the management cluster is running with the default concurrency of 10. Ten new AzureMachines are created and their state quickly becomes "Creating". Shortly after, before the first ten machines are done creating, ten new ones start creating. The same thing happens until all 200 AzureMachines are in "Creating" state. Alex checks the Conditions on one of the creating AzureMachines and sees that the network interface was created successfully, and that the VM is being created. This allows Alex to quickly scale up the number of replicas as the new 200 VMs get created concurrently, without having to increase the controller concurrency.
    90  
    91  #### <a name='Story3-DeletinganindividualAzureMachinePoolMachine'></a>Story 3 - Deleting an individual Azure Machine Pool Machine
    92  
    93  Kai is an engineer in a large organization which has a MachinePool running. Kai needs to delete the Machine Pool. Kai uses `kubectl` to delete the Machine Pool. After a few seconds, Kai checks the Conditions on the MachinePool and sees that the VM is being deleted.
    94  
    95  ### <a name='ImplementationDetailsNotesConstraints'></a>Implementation Details/Notes/Constraints
    96  
    97  There is an existing implementation of asynchronous reconciliation for AzureMachinePools. The   `AzureMachinePoolStatus` stores a single `LongRunningOperationState` used to keep the Future returned by VMSS long running operations.
    98  
    99  ```go
   100  // Future contains the data needed for an Azure long-running operation to continue across reconcile loops.
   101  type Future struct {
   102      // Type describes the type of future, update, create, delete, etc
   103      Type string `json:"type"`
   104      // ResourceGroup is the Azure resource group for the resource
   105      // +optional
   106      ResourceGroup string `json:"resourceGroup,omitempty"`
   107      // Name is the name of the Azure resource
   108      // +optional
   109      Name string `json:"name,omitempty"`
   110      // FutureData is the base64 url encoded json Azure AutoRest Future
   111      FutureData string `json:"futureData,omitempty"`
   112  }
   113  
   114  // AzureMachinePoolStatus defines the observed state of AzureMachinePool
   115  AzureMachinePoolStatus struct {
   116      /*
   117        Other fields omitted for brevity    
   118      */
   119      
   120      // LongRunningOperationState saves the state for an Azure long-running operations so it can be continued on the
   121      // next reconciliation loop.
   122      // +optional
   123      LongRunningOperationState *infrav1.Future `json:"longRunningOperationState,omitempty"`
   124  }
   125  ```
   126  
   127  ### <a name='ProposedAPIChanges'></a>Proposed API Changes
   128  
   129  The proposed changes below show the changes to AzureCluster, AzureMachine, AzureMachinePool, and AzureMachinePoolMachine. The existing `LongRunningOperationState` field in AzureMachinePoolStatus will be pluralized to `LongRunningOperationStates` to store a list of Futures, following a similar pattern than Conditions, and will be extended to other CAPZ CRDs. In addition, the `Name` field of the `Future` type will be made required, as it becomes the identifier for a Future.
   130  
   131  ```go
   132  // Future contains the data needed for an Azure long-running operation to continue across reconcile loops.
   133  type Future struct {
   134      // Type describes the type of future, such as update, create, delete, etc
   135      Type string `json:"type"`
   136      // ResourceGroup is the Azure resource group for the resource.
   137      // +optional
   138      ResourceGroup string `json:"resourceGroup,omitempty"`
   139      // ServiceName is the name of the Azure service the resource belongs to.
   140      ServiceName string `json:"serviceName"`
   141      // Name is the name of the Azure resource.
   142      Name string `json:"name"`
   143      // Data is the base64 url encoded json Azure AutoRest Future.
   144      Data string `json:"data,omitempty"`
   145  }
   146  
   147  type Futures []Future
   148  
   149  // AzureClusterStatus defines the observed state of AzureCluster.
   150  type AzureClusterStatus struct {
   151      /*
   152        Other fields omitted for brevity    
   153      */
   154  
   155      // LongRunningOperationStates saves the states for Azure long-running operations so they can be continued on the
   156      // next reconciliation loop.
   157      // +optional
   158      LongRunningOperationStates Futures `json:"longRunningOperationStates,omitempty"`
   159  }
   160  
   161  // AzureMachineStatus defines the observed state of AzureMachine.
   162  type AzureMachineStatus struct {
   163      /*
   164        Other fields omitted for brevity    
   165      */
   166  
   167      // LongRunningOperationStates saves the states for Azure long-running operations so they can be continued on the
   168      // next reconciliation loop.
   169      // +optional
   170      LongRunningOperationStates Futures `json:"longRunningOperationStates,omitempty"`
   171  }
   172  
   173  // AzureMachinePoolStatus defines the observed state of AzureMachinePool.
   174  type AzureMachinePoolStatus struct {
   175      /*
   176        Other fields omitted for brevity    
   177      */
   178  
   179      // LongRunningOperationStates saves the states for Azure long-running operations so they can be continued on the
   180      // next reconciliation loop.
   181      // +optional
   182      LongRunningOperationStates Futures `json:"longRunningOperationStates,omitempty"`
   183  }
   184  
   185  // AzureMachinePoolMachineStatus defines the observed state of AzureMachinePoolMachine.
   186  type AzureMachinePoolMachineStatus struct {
   187      /*
   188        Other fields omitted for brevity
   189      */
   190  
   191      // LongRunningOperationStates saves the states for Azure long-running operations so they can be continued on the
   192      // next reconciliation loop.
   193      // +optional
   194      LongRunningOperationStates Futures `json:"longRunningOperationStates,omitempty"`
   195  }
   196  
   197  ```
   198  
   199  ### <a name='ProposedControllerChanges'></a>Proposed Controller Changes
   200  
   201  #### <a name='Contexttimeouts'></a>Context timeouts
   202  
   203  * Reduce the global controller reconcile loop context timeout to 15 seconds (currently 90 minutes).
   204  * For each Azure service reconcile, add a local context timeout of 5 seconds.
   205  * Add an `AzureClientTimeout` which is the duration after which an Azure operation is considered a long running operation which should be handled asynchronously. Proposed starting value is 2 seconds.
   206  * For each Azure API call which returns a Future, wait for the operation to be completed for the above timeout duration. If the operation is not completed within the timeout duration, set Future of that resource in `LongRunningOperationStates` with the marshalled future data.
   207  
   208  For each Azure service, this is what the new asynchronous reconcile and delete flows will look like:
   209  
   210  #### <a name='ServiceReconcile'></a>Service Reconcile
   211  
   212  ![Figure 1](./images/async-reconcile.png)
   213  
   214  #### <a name='ServiceDelete'></a>Service Delete
   215  
   216  ![Figure 2](./images/async-delete.png)
   217  
   218  And below is a diagram to illustrate what an end-to-end flow of the proposed AzureCluster Reconcile would look like.
   219  
   220  #### <a name='AzureClusterReconcile'></a>AzureCluster Reconcile
   221  
   222  ![Figure 3](./images/azure-cluster-reconcile.png)
   223  
   224  * Note 1: this represents an example AzureCluster reconcile loop. Some additional services may be called for some AzureClusters. Similar concepts apply to the other controllers (e.g. AzureCluster Delete, AzureMachine Reconcile, etc.)
   225  * Note 2: Resource Group and VNet can only have 1 resource of each type. The other services may have one or more resources to create. For services which have multiple resources to create, the controller will be able kick off multiple asynchronous operations to create or delete the resources of the same type, assuming they all get started within the local and global context timeout. This is based on the assumption that no two resources of the same type should have any dependency on each other. For example, if there are 3 load balancers to be deleted, all 3 delete operations will be started in the the same reconcile loop, even if one or more of the calls doesn't complete within the `AzureClientTimeout`.
   226  
   227  ### <a name='ProposedNewConditions'></a>Proposed New Conditions
   228  
   229  * Set conditions at the end of each controller loop that describe the current state of the object and its associated Azure resources.
   230  
   231  The existing conditions before this proposal can be seen [here](https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/v0.5.0/api/v1alpha4/conditions_consts.go). Note that these existing conditions will be left unchanged, and are purposefully left out below.
   232  
   233  Part of the proposed changes is to add new conditions for Azure CRDs. More granular conditions, paired with more responsive controllers, will allow for better visibility into the state of each resource. Initially, the following conditions will be added:
   234  
   235  ```go
   236  // Azure Services Conditions and Reasons.
   237  const (
   238      // ResourceGroupReadyCondition means the resource group exists and is ready to be used.
   239      ResourceGroupReadyCondition clusterv1.ConditionType = "ResourceGroupReady"
   240      // VNetReadyCondition means the virtual network exists and is ready to be used.
   241      VNetReadyCondition clusterv1.ConditionType = "VNetReady"
   242      // SecurityGroupsReadyCondition means the security groups exist and are ready to be used.
   243      SecurityGroupsReadyCondition clusterv1.ConditionType = "SecurityGroupsReady"
   244      // RouteTablesReadyCondition means the route tables exist and are ready to be used.
   245      RouteTablesReadyCondition clusterv1.ConditionType = "RouteTablesReady"
   246      // PublicIPsReadyCondition means the public IPs exist and are ready to be used.
   247      PublicIPsReadyCondition clusterv1.ConditionType = "PublicIPsReady"
   248      // NATGatewaysReadyCondition means the NAT gateways exist and are ready to be used.
   249      NATGatewaysReadyCondition clusterv1.ConditionType = "NATGatewaysReady"
   250      // SubnetsReadyCondition means the subnets exist and are ready to be used.
   251      SubnetsReadyCondition clusterv1.ConditionType = "SubnetsReady"
   252      // LoadBalancersReadyCondition means the load balancers exist and are ready to be used.
   253      LoadBalancersReadyCondition clusterv1.ConditionType = "LoadBalancersReady"
   254      // PrivateDNSReadyCondition means the private DNS exists and is ready to be used.
   255      PrivateDNSReadyCondition clusterv1.ConditionType = "PrivateDNSReady"
   256      // BastionHostReadyCondition means the bastion host exists and is ready to be used.
   257      BastionHostReadyCondition clusterv1.ConditionType = "BastionHostReady"
   258      // InboundNATRulesReadyCondition means the inbound NAT rules exist and are ready to be used.
   259      InboundNATRulesReadyCondition clusterv1.ConditionType = "InboundNATRulesReady"
   260      // AvailabilitySetReadyCondition means the availability set exists and is ready to be used.
   261      AvailabilitySetReadyCondition clusterv1.ConditionType = "AvailabilitySetReady"
   262      // RoleAssignmentReadyCondition means the role assignment exists and is ready to be used.
   263      RoleAssignmentReadyCondition clusterv1.ConditionType = "RoleAssignmentReady"
   264  
   265      // CreatingReason means the resource is being created.
   266      CreatingReason = "Creating"
   267      // FailedReason means the resource failed to be created.
   268      FailedReason = "Failed"
   269      // DeletingReason means the resource is being deleted.
   270      DeletingReason = "Deleting"
   271      // DeletedReason means the resource was deleted.
   272      DeletedReason = "Deleted"
   273      // DeletionFailedReason means the resource failed to be deleted.
   274      DeletionFailedReason = "DeletionFailed"
   275  )
   276  ```
   277  
   278  ### <a name='OpenQuestions'></a>Open Questions
   279  
   280  #### <a name='Whatshouldthetimeoutdurationsbe'></a>1. What should the timeout durations be?
   281  
   282  The specific numbers are not set in stone, and should be revised after doing some performance testing with different values and calculating the P99 expected completion time of operations that are not long-running.
   283  
   284  The other question is whether we should have the same timeout value for all operations (the 5s) or curate per operation. For simplicity, the proposal is to start with a single value. Later on, we might want to optimize by calculating a dynamic timeout value for each operation based on heuristics. That would be better than statically defining artificial timeout durations for each operation which might vary over time and might not be the same depending on region, subscription, etc.
   285  
   286  #### 2. How to handle transient errors in logs?
   287  
   288  The idea of short-circuiting the Reconcile loop when a long-running operation is in progress involves returning an error when an operation is not done. This means that the Reconcile loop will end in an error every time an operation is in progress. This is necessary because we need to requeue so that the reconcile loop can run again to check on the progress of the operation, but it also means that the user might will see the error message in the logs. How can we handle transient errors in logs without spamming the logs and therefore causing noise that reduces the user's ability to see actual errors in reconcile?
   289  
   290  ## <a name='Alternatives'></a>Alternatives
   291  
   292  ### <a name='ParallelreconciliationofAzureservices'></a>Parallel reconciliation of Azure services
   293  
   294  The idea would be to start multiple Azure operations in parallel. This could be done either by defining a dependency graph or by starting all operations in parallel and retrying the ones that fail until they all succeed.
   295  
   296  #### <a name='Pros'></a>Pros
   297  
   298  - Reduces the overall time it takes to do a full reconcile
   299  
   300  #### <a name='Cons'></a>Cons
   301  
   302  - Most of the resources have dependencies on one another which means they have to be created and deleted serially, so the actual gain we get from parallelizing is minimal.
   303  - Added complexity and maintenance of the dependency graph.
   304  - If not using a dependency graph, sending bad requests to Azure would increase the number of API calls and possibly cause a busy signal from the Azure APIs.
   305  
   306  #### <a name='Conclusion'></a>Conclusion
   307  
   308  This is not mutually exclusive with the proposal above. In fact, it might be a good idea to do both in the long run. However, the gains from parallelizing the operations are minimal compared to what we can get by not blocking on long running operations so we should proceed by first making the resource creation and deletion async, then evaluate to see if we need further performance improvements.
   309  
   310  ## <a name='AdditionalDetails'></a>Additional Details
   311  
   312  ### <a name='TestPlan'></a>Test Plan
   313  
   314  * Unit tests to validate the proper handling of Futures in the various CRD Status fields.
   315  * existing e2e tests for create, upgrade, scale down / up, and delete
   316  
   317  ## <a name='ImplementationHistory'></a>Implementation History
   318  
   319  - 2020/12/04: Initial POC [PR](https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/1067) for AzureMachinePool opened
   320  - 2021/07/16: Initial proposal