sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/proposals/20201214-bootstrap-failure-detection.md (about) 1 --- 2 title: Bootstrap failure detection 3 authors: 4 - "@CecileRobertMichon" 5 - "@jackfrancis" 6 reviewers: 7 - @devigned 8 - @nader-ziada 9 creation-date: 2020-07-28 10 last-updated: 2020-12-14 11 status: implementable 12 see-also: 13 - https://github.com/kubernetes-sigs/cluster-api/issues/2554 14 - https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/603 15 --- 16 17 18 # Bootstrap failure detection 19 20 21 ## Summary 22 23 The status of VM bootstrap operations (cloud-init, kubeadm) is opaque from the perspective of cluster-api resources that represent those VMs. 24 25 ## Motivation 26 27 ### Goals 28 - Assist programmatic consumers and end users in determining when and why bootstrapping failed 29 - Enable management clusters to determine bootstrapping status 30 - Solve the Azure provider following generic bootstrap status interfaces defined by cluster-api 31 - Be compatible with all bootstrap providers 32 - Work for both Linux and Windows 33 34 ### Non-Goals / Future Work 35 - Implement full cloud-init/kubeadm log stream data as a part of the cluster-api/capz resource. 36 37 ## Available options for Cluster API Provider Azure 38 39 ### Option 1: Enable VM boot diagnostics 40 Azure VM and VMSS support a boot diagnostics feature which streams cloud init logs and boot time output into a storage account. This would allow log collection for some aspects of bootstrapping (at least cloud init logs). 41 See https://learn.microsoft.com/azure/virtual-machines/boot-diagnostics and https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/606 42 43 #### Enable VM Boot Diagnostics Pros: 44 - Low effort (this VM feature gets us basic logs for free once we enable it) 45 - All details of VM and OS bootstrapping are persisted 46 - Available for Linux and Windows 47 - No control over the way the logs output is rendered 48 49 #### Enable VM Boot Diagnostics Cons: 50 - Requires a storage account 51 - Add’l IaaS cost 52 - Would not expose diagnostics to cluster components, have to consume directly from the VM 53 - Can’t really use this option exclusively to solve the “determine bootstrap status” goal, would have to be used in concert with additional controller implementation that synthesizes the boot diagnostics output 54 55 ### Option 2: Pub/Sub model using Azure Service Bus 56 Similar to the "Simple Notification Service & Simple Queue Service" solution for AWS above. 57 For more info see https://learn.microsoft.com/azure/service-bus-messaging/ and https://github.com/Azure/azure-service-bus-go 58 59 #### Pub Sub Pros: 60 - Extensible, can support lots of client patterns, e.g.: 61 - Bootstrap process publishes logs and mgmt cluster consumes them 62 - AzureMachine reconciliation can detect and possibly deal with certain failure conditions 63 64 #### Pub Sub Cons: 65 - Complicated, lots of additional moving pieces: 66 - Might have to write our own OS-specific tools to consume bootstrap logs and publish them, and those tools would be installed by additional VM Extensions, or add'l cloud-init configuration 67 - Add’l IaaS cost 68 69 ### Option 3: Azure Custom Script Extensions 70 The Custom Script Extension downloads and executes scripts on Azure virtual machines (and VMSS instances). We could leverage extensions to either 1) run kubeadm init/join commands (ie. move the "runcmd" content from cloud init to a custom script extension). This is useful because you can control the exit code for VM Extensions which allows for better error reporting than cloud init. The max script size is also 256 KB (vs 64 KB for user data). This does not collect logs, but part of the extension could be to export the logs externally (to a storage account for example). The extension could also be used purely for checking bootstrapping status (ie. cloud init runcmd still runs the init/join) and exporting logs. 71 https://learn.microsoft.com/azure/virtual-machines/extensions/custom-script-linux 72 https://learn.microsoft.com/azure/virtual-machines/extensions/custom-script-windows 73 74 #### Custom Script Extension Pros: 75 - Generic and flexible for both Linux and Windows, can basically execute any arbitrary code (so long as it’s under the size limits defined above) 76 77 #### Custom Script Extension Cons: 78 - You may only have one Custom Script Extension per VM, so using this interface to implement bootstrap failure detection means we are not able to expose the Custom Script Extension VM feature to users as a “general purpose” script interface 79 80 ### Option 4: VM run command 81 https://learn.microsoft.com/azure/virtual-machines/linux/run-command 82 This is similar to the above idea of using a custom script extension but instead of deploying an additional VM extension resource, the Run Command feature uses the virtual machine (VM) agent to run shell scripts within an Azure VM. This works without requiring RDP/SSH access to the VM. 83 84 #### VM runcmd Pros: 85 - Relatively simple; we already have mature Azure SDK patterns in the AzureMachine controller implementation, would not be that much additional work to incorporate a runcmd operation against node vms 86 87 #### VM runcmd Cons: 88 - Limited stdout from runcmd request output 89 - As with the VM Boot Diagnostics solution, this requires the AzureMachine controller to actually synthesize the runcmd (or multiple runcmd) result(s) into a terminal state outcome 90 91 ### Option 5: Custom capz-specific Azure VM Extension (recommended) 92 A custom Azure VM Extension is basically a unit of foo that does a very finite set of things on a VM as it bootstraps itself. We could use this to implement a set of capz-focused bootstrap failure reporting, to support both investigation and remediation. 93 94 #### Custom capz VM Extension Pros: 95 - The same generic flexibility (although rendered as a concrete solution, not at runtime) as the Custom Script Extension option above, but maintains availability for future, user-configurable Custom Script Extensions solutions 96 - Exposes a convenient binary success/failure as part of the Azure VM resource itself 97 - Easy to query from the AzureMachine controller 98 - Easy for users to introspect via Azure APIs, CLI, portal 99 - Named property allows for convenient disambiguation from generic Azure (or other non-capz-related) errors 100 101 #### Custom capz VM Extension Cons: 102 - Non-trivial (though one-time) administrative overhead to create a named Azure VM Extension 103 - Ongoing maintenance requires a separate release workflow compared to capz (in other words, we don’t simply ship changes to this *with* capz releases) 104 105 ### Option 6: postKubeadmCommand 106 This is an option that could be used for kubeadm bootstrap-provided solutions only, which exposes an array interface to execute a set of arbitrary, serialized shell statements (essentially a thin wrapper around cloud-init’s runcmd interface) after kubeadm finishes. 107 108 #### postKubeadmCommand Pros: 109 - The interface is already present, assuming we only want to solve this for the kubeadm bootstrap provider 110 111 #### postKubeadmCommand Cons: 112 - *only* works for kubeadm bootstrap provider 113 - Sort of breaks the UX contract for postKubeadmCommand, which is intended to be a user-configurable interface and not reserved for use by cluster-api controllers 114 115 ## Conclusions 116 A few conclusions surfaced when exploring these options: 117 118 1. Evaluating simple success/failure of VM bootstrapping is most easily done on the VM itself, because under no scenarios is there an option *not* to source some of the relevant input data from the VM. And because we can’t avoid establishing a connection to the VM’s filesystem, it simplifies things greatly to do that locally via a process/daemon running on the VM. 119 2. The actual implementation that determines “did I bootstrap successfully?” should be defined by each bootstrap provider, as each provider has its own files/operational conditions to validate. The validation on the Azure side should be as minimal as possible and delegate all responsibility of running checks to the bootstrap provider. 120 3. We need to support Linux and Windows, and though there is one convenience (VM Boot Diagnostics) that may allow us to get a common result across both OSes “for free”, in practice there is enough heterogeneity at all layers (VM, OS, potentially even capi) that we should expect to have to maintain a discrete set of implementations for each platform. So we want to choose a solution that makes supporting both Linux and Windows distinctly natural. 121 122 The most sensible solution would be to reuse the existing CustomScriptExtension interface that can be attached to both Windows and Linux VMs. But the fact that VMs may only support a single CustomScriptExtension is a non-trivial problem, as it removes that configuration vector for users. That vector can be a powerful configuration option — paired with custom OS images — to deliver regular runtime functionality to the underlying Azure VM running as a Kubernetes node. In particular during emergency scenarios being able to “patch” your node’s Azure VM implementation quickly using this interface can save a user many hours if he/she had to otherwise wait for a new OS image, or worse, a new VHD publication. 123 124 So, given that we don’t want to “reserve” the CustomScriptExtension VM interface for capz, thus preventing users from using it more generically and flexibly (as it’s intended to be used), we want to propose curating a capz-specific Azure VM Extension dedicated to running on the VM during provisioning and evaluating the success/fail state of its bootstrap operation(s) towards joining a capz-enabled Kubernetes cluster. 125 126 At a very high level, this is what we want our capz-named Azure VM Extension to do: 127 128 - Wait for a configurable time duration to validate the minimum necessary to determine bootstrap success/fail 129 - This would require updating the CAPI bootstrap provider contract to include a signal (such a sentinel file) on the VM to indicate that all bootstrap operations have finished successfully 130 - When terminal state has been reached, return an appropriate exit code to the Azure VM Extension itself 131 - At a minimum we will return a binary (e.g., 0 for success, 1 for failure) exit state 132 - If a terminal state has not been reached before the configurable timeout has been reached, return an appropriate failure exit code 133 - Again, we assume using a common exit code for all failure states is acceptable for the initial scope of this work 134 - Set appropriate AzureMachine (and possibly Machine?) conditions 135 136 VM Boot Diagnostics should be used in conjunction with the extension. The VM extension provides a simple pass/fail signal that can be used by CAPZ to set conditions and indicate bootstrap status. Boot Diagnostics can provide a quick look at what went wrong to the user by displaying cloud-init logs without needing to SSH into the VM. In the future, boot diagnostics might even used to stream logs programmatically at the AzureMachine level. 137 138 139 ## Questions 140 - Can the custom Azure VM Extension be overloaded to solve for both the Windows and Linux case at runtime? In other words, can we publish a single Extension that will be able to easily choose the Windows or Linux path depending upon the OS type of the VM it’s attached to? 141 - No, a separate extension has to be published for Linux and Windows. 142 - Will the extension code be open-source? 143 - Yes, the CAPZ extension will be a clone of the [custom script extension](https://github.com/Azure/custom-script-extension-linux). 144 - Will the extension need to be republished often? 145 - No. Once the extension is published once, we don't expect to have to republish it unless code defects are found in the extension itself. The script run by the extension will live in the cluster-api-provider-azure repository and can be updated without changing the extension itself. 146 - Will the extension be available in all Azure regions and clouds? 147 - Yes. At first, the extension will be available in all Azure Public Cloud regions. Shortly after, it will be published in other clouds. 148 - Does this proposed solution work for both VMs and VMSS? 149 - Yes. Scale sets have can have a common extension that runs on all instances.