sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/proposals/20201214-bootstrap-failure-detection.md (about)

     1  ---
     2  title: Bootstrap failure detection
     3  authors:
     4    - "@CecileRobertMichon"
     5    - "@jackfrancis"
     6  reviewers:
     7    - @devigned
     8    - @nader-ziada
     9  creation-date: 2020-07-28
    10  last-updated: 2020-12-14
    11  status: implementable
    12  see-also:
    13  - https://github.com/kubernetes-sigs/cluster-api/issues/2554
    14  - https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/603
    15  ---
    16  
    17  
    18  # Bootstrap failure detection
    19  
    20  
    21  ## Summary
    22  
    23  The status of VM bootstrap operations (cloud-init, kubeadm) is opaque from the perspective of cluster-api resources that represent those VMs.
    24  
    25  ## Motivation
    26  
    27  ### Goals
    28  - Assist programmatic consumers and end users in determining when and why bootstrapping failed
    29  - Enable management clusters to determine bootstrapping status
    30  - Solve the Azure provider following generic bootstrap status interfaces defined by cluster-api
    31  - Be compatible with all bootstrap providers
    32  - Work for both Linux and Windows
    33  
    34  ### Non-Goals / Future Work
    35  - Implement full cloud-init/kubeadm log stream data as a part of the cluster-api/capz resource.
    36  
    37  ## Available options for Cluster API Provider Azure
    38  
    39  ### Option 1: Enable VM boot diagnostics
    40  Azure VM and VMSS support a boot diagnostics feature which streams cloud init logs and boot time output into a storage account. This would allow log collection for some aspects of bootstrapping (at least cloud init logs).
    41  See https://learn.microsoft.com/azure/virtual-machines/boot-diagnostics and https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/606
    42  
    43  #### Enable VM Boot Diagnostics Pros:
    44  - Low effort (this VM feature gets us basic logs for free once we enable it)
    45  - All details of VM and OS bootstrapping are persisted
    46  - Available for Linux and Windows
    47  - No control over the way the logs output is rendered
    48  
    49  #### Enable VM Boot Diagnostics Cons:
    50  - Requires a storage account
    51  - Add’l IaaS cost
    52  - Would not expose diagnostics to cluster components, have to consume directly from the VM
    53  - Can’t really use this option exclusively to solve the “determine bootstrap status” goal, would have to be used in concert with additional controller implementation that synthesizes the boot diagnostics output
    54  
    55  ### Option 2: Pub/Sub model using Azure Service Bus
    56  Similar to the "Simple Notification Service & Simple Queue Service" solution for AWS above.
    57  For more info see https://learn.microsoft.com/azure/service-bus-messaging/ and https://github.com/Azure/azure-service-bus-go
    58  
    59  #### Pub Sub Pros:
    60  - Extensible, can support lots of client patterns, e.g.:
    61  - Bootstrap process publishes logs and mgmt cluster consumes them
    62  - AzureMachine reconciliation can detect and possibly deal with certain failure conditions
    63  
    64  #### Pub Sub Cons:
    65  - Complicated, lots of additional moving pieces:
    66  - Might have to write our own OS-specific tools to consume bootstrap logs and publish them, and those tools would be installed by additional VM Extensions, or add'l cloud-init configuration
    67  - Add’l IaaS cost
    68  
    69  ### Option 3: Azure Custom Script Extensions
    70  The Custom Script Extension downloads and executes scripts on Azure virtual machines (and VMSS instances). We could leverage extensions to either 1) run kubeadm init/join commands (ie. move the "runcmd" content from cloud init to   a custom script extension). This is useful because you can control the exit code for VM Extensions which allows for better error reporting than cloud init. The max script size is also 256 KB (vs 64 KB for user data). This does not collect logs, but part of the extension could be to export the logs externally (to a storage account for example). The extension could also be used purely for checking bootstrapping status (ie. cloud init runcmd still runs the init/join) and exporting logs.
    71  https://learn.microsoft.com/azure/virtual-machines/extensions/custom-script-linux
    72  https://learn.microsoft.com/azure/virtual-machines/extensions/custom-script-windows
    73  
    74  #### Custom Script Extension Pros:
    75  - Generic and flexible for both Linux and Windows, can basically execute any arbitrary code (so long as it’s under the size limits defined above)
    76  
    77  #### Custom Script Extension Cons:
    78  - You may only have one Custom Script Extension per VM, so using this interface to implement bootstrap failure detection means we are not able to expose the Custom Script Extension VM feature to users as a “general purpose” script interface
    79  
    80  ### Option 4: VM run command
    81  https://learn.microsoft.com/azure/virtual-machines/linux/run-command
    82  This is similar to the above idea of using a custom script extension but instead of deploying an additional VM extension resource, the Run Command feature uses the virtual machine (VM) agent to run shell scripts within an Azure VM. This works without requiring RDP/SSH access to the VM.
    83  
    84  #### VM runcmd Pros:
    85  - Relatively simple; we already have mature Azure SDK patterns in the AzureMachine controller implementation, would not be that much additional work to incorporate a runcmd operation against node vms
    86  
    87  #### VM runcmd Cons:
    88  - Limited stdout from runcmd request output
    89  - As with the VM Boot Diagnostics solution, this requires the AzureMachine controller to actually synthesize the runcmd (or multiple runcmd) result(s) into a terminal state outcome
    90  
    91  ### Option 5: Custom capz-specific Azure VM Extension (recommended)
    92  A custom Azure VM Extension is basically a unit of foo that does a very finite set of things on a VM as it bootstraps itself. We could use this to implement a set of capz-focused bootstrap failure reporting, to support both investigation and remediation.
    93  
    94  #### Custom capz VM Extension Pros:
    95  - The same generic flexibility (although rendered as a concrete solution, not at runtime) as the Custom Script Extension option above, but maintains availability for future, user-configurable Custom Script Extensions solutions
    96  - Exposes a convenient binary success/failure as part of the Azure VM resource itself
    97  - Easy to query from the AzureMachine controller
    98  - Easy for users to introspect via Azure APIs, CLI, portal
    99  - Named property allows for convenient disambiguation from generic Azure (or other non-capz-related) errors
   100  
   101  #### Custom capz VM Extension Cons:
   102  - Non-trivial (though one-time) administrative overhead to create a named Azure VM Extension
   103  - Ongoing maintenance requires a separate release workflow compared to capz (in other words, we don’t simply ship changes to this *with* capz releases)
   104  
   105  ### Option 6: postKubeadmCommand
   106  This is an option that could be used for kubeadm bootstrap-provided solutions only, which exposes an array interface to execute a set of arbitrary, serialized shell statements (essentially a thin wrapper around cloud-init’s runcmd interface) after kubeadm finishes.
   107  
   108  #### postKubeadmCommand Pros:
   109  - The interface is already present, assuming we only want to solve this for the kubeadm bootstrap provider
   110  
   111  #### postKubeadmCommand Cons:
   112  - *only* works for kubeadm bootstrap provider
   113  - Sort of breaks the UX contract for postKubeadmCommand, which is intended to be a user-configurable interface and not reserved for use by cluster-api controllers
   114  
   115  ## Conclusions
   116  A few conclusions surfaced when exploring these options:
   117  
   118  1. Evaluating simple success/failure of VM bootstrapping is most easily done on the VM itself, because under no scenarios is there an option *not* to source some of the relevant input data from the VM. And because we can’t avoid establishing a connection to the VM’s filesystem, it simplifies things greatly to do that locally via a process/daemon running on the VM.
   119  2. The actual implementation that determines “did I bootstrap successfully?” should be defined by each bootstrap provider, as each provider has its own files/operational conditions to validate. The validation on the Azure side should be as minimal as possible and delegate all responsibility of running checks to the bootstrap provider.
   120  3. We need to support Linux and Windows, and though there is one convenience (VM Boot Diagnostics) that may allow us to get a common result across both OSes “for free”, in practice there is enough heterogeneity at all layers (VM, OS, potentially even capi) that we should expect to have to maintain a discrete set of implementations for each platform. So we want to choose a solution that makes supporting both Linux and Windows distinctly natural.
   121  
   122  The most sensible solution would be to reuse the existing CustomScriptExtension interface that can be attached to both Windows and Linux VMs. But the fact that VMs may only support a single CustomScriptExtension is a non-trivial problem, as it removes that configuration vector for users. That vector can be a powerful configuration option — paired with custom OS images — to deliver regular runtime functionality to the underlying Azure VM running as a Kubernetes node. In particular during emergency scenarios being able to “patch” your node’s Azure VM implementation quickly using this interface can save a user many hours if he/she had to otherwise wait for a new OS image, or worse, a new VHD publication.
   123  
   124  So, given that we don’t want to “reserve” the CustomScriptExtension VM interface for capz, thus preventing users from using it more generically and flexibly (as it’s intended to be used), we want to propose curating a capz-specific Azure VM Extension dedicated to running on the VM during provisioning and evaluating the success/fail state of its bootstrap operation(s) towards joining a capz-enabled Kubernetes cluster.
   125  
   126  At a very high level, this is what we want our capz-named Azure VM Extension to do:
   127  
   128  - Wait for a configurable time duration to validate the minimum necessary to determine bootstrap success/fail
   129    - This would require updating the CAPI bootstrap provider contract to include a signal (such a sentinel file) on the VM to indicate that all bootstrap operations have finished successfully
   130  - When terminal state has been reached, return an appropriate exit code to the Azure VM Extension itself
   131    - At a minimum we will return a binary (e.g., 0 for success, 1 for failure) exit state
   132  - If a terminal state has not been reached before the configurable timeout has been reached, return an appropriate failure exit code
   133    - Again, we assume using a common exit code for all failure states is acceptable for the initial scope of this work
   134  - Set appropriate AzureMachine (and possibly Machine?) conditions
   135  
   136  VM Boot Diagnostics should be used in conjunction with the extension. The VM extension provides a simple pass/fail signal that can be used by CAPZ to set conditions and indicate bootstrap status. Boot Diagnostics can provide a quick look at what went wrong to the user by displaying cloud-init logs without needing to SSH into the VM. In the future, boot diagnostics might even used to stream logs programmatically at the AzureMachine level.
   137  
   138  
   139  ## Questions
   140  - Can the custom Azure VM Extension be overloaded to solve for both the Windows and Linux case at runtime? In other words, can we publish a single Extension that will be able to easily choose the Windows or Linux path depending upon the OS type of the VM it’s attached to?
   141      - No, a separate extension has to be published for Linux and Windows.
   142  - Will the extension code be open-source?
   143      - Yes, the CAPZ extension will be a clone of the [custom script extension](https://github.com/Azure/custom-script-extension-linux).
   144  - Will the extension need to be republished often?
   145      - No. Once the extension is published once, we don't expect to have to republish it unless code defects are found in the extension itself. The script run by the extension will live in the cluster-api-provider-azure repository and can be updated without changing the extension itself.
   146  - Will the extension be available in all Azure regions and clouds?
   147      - Yes. At first, the extension will be available in all Azure Public Cloud regions. Shortly after, it will be published in other clouds.
   148  - Does this proposed solution work for both VMs and VMSS?
   149      - Yes. Scale sets have can have a common extension that runs on all instances.