github.com/cloud-foundations/dominator@v0.0.0-20221004181915-6e4fee580046/design-docs/MachineBirthing/README.md (about) 1 Machine Birthing 2 ================ 3 Richard Gooch 4 ------------- 5 6 Background 7 ========== 8 9 Growing machine capacity in a datacentre environment is often done by rolling in multiple racks of machines, wiring them in and powering them up. Once powered up, it is common for operations staff to use scripts and other automation tools to *birth* (install and configure) the machines. These automation tools typically build on top of other tools which were designed to birth a single machine (i.e. boot from an installation CD/ISO image). The layers of tools can make the birthing process less reliable and efficient, and leave the machine in a state where it is ready for further configuration rather than being ready for actual useful work. These tools often neglect other aspects of the machine life-cycle, such as automated repairs. 10 11 This document describes the design of a fully automated, robust, reliable and efficient architecture for (re)birthing machines at large scale. The design target is that 100 racks of machines can be turned on and within an hour all the new machines are available for real work, *without any further human intervention* nor any preparatory software configuration. The software system that will implement this architecture is called the **Birther**. 12 13 The **Birther** system depends on the [**Dominator**](../Dominator/README.md) system, which is likely to be the limiting factor in how quickly machines can be made available for real work. A more focussed design target for the **Birther** system is that it can **sub**ject more machines per second to **Domination** than the **Dominator** can complete a full system update on per second. 14 15 High-level Design 16 ================= 17 18 The system is comprised of the following components: 19 20 - a **M**achine **D**ata**B**ase (**MDB**) which lists all the machines in the fleet and their properties 21 22 - a **Birther** machine which responds to PXE boot requests from machines 23 24 - a **Boot** **Server** (containing a DHCP and TFTP server), which is used to install a tiny **Bootstrap Image** 25 26 - a **Bootstrap Image** which configures the machine, enters it into the **MDB** and enables the machine for **Domination** 27 28 - a [**Dominator**](../Dominator/README.md) system which is used to install fully configured, workload-ready images 29 30 The following diagram shows how these components are connected: 31  32 33 The MDB 34 ------- 35 36 The **MDB** is the sole source of truth which defines the intended state of the fleet. It lists all the known machines in the fleet and records the name, IP address, MAC address, *required* system image, repair state and so on. 37 38 The Birther 39 ----------- 40 41 The **Birther** listens for PXE boot requests from any machine, and consults the **MDB** to determine what kind of response to send. In all cases, a response is sent. The following **MDB** states are defined: 42 43 - *unknown*: the system is not yet known in the **MDB** 44 45 - *birth*: a temporary private IP address is assigned and the PXE response instructs the machine to load and boot the **Bootstrap Image** 46 47 - *healthy*: the system is known in the **MDB** and is considered healthy. The PXE response instructs the machine to boot from local media. This is an optimisation that dramatically decreases system reboot time, as the machine does not have to wait for the PXE boot timeout before booting from local media 48 49 - *rebirth*: the system is known in the **MDB** and is in need of a software repair (a **rebirth**). The permanent IP address is assigned and the PXE response instructs the machine to load and boot the **Bootstrap Image** 50 51 - *clean*: the system is known in the **MDB** and needs to be cleaned (old data removed). The permanent IP address is assigned and the PXE response instructs the machine to load and boot the **Fast Bootstrap Image** 52 53 The **Birther** stores PXE boot request and response statistics in the **MDB** so that persistently failing machines can be detected. 54 55 The Boot Server 56 --------------- 57 58 The **Boot Server** contains a DHCP and TFTP server and serves requests for the **Bootstrap Image**. It will respond to requests on the private IP network used for temporary addresses as well as requests on the main IP network for permanent addresses. The Hypervisor in SmallStack (part of the [**Dominator**](../Dominator/README.md) ecosystem) contains a **Boot Server** which is integrated with the ecosystem (including image building and distribution). Consult your favourite search engine for generic implementations. 59 60 The Bootstrap Image 61 ------------------- 62 63 The **Bootstrap Image** contains: 64 65 - a generic kernel 66 67 - a small compressed file-system which contains: 68 69 - a configuration tool, which is run as the *init* process 70 71 - a copy of **subd** from the **Dominator** system and a Certificate Authority file 72 73 The configuration tool performs initial setup and then hands the machine over to the **Dominator**. 74 75 ### 76 77 ### The Fast Bootstrap Image 78 79 This is the same as the **Bootstrap Image** except that a burn-in test is not performed. 80 81 The Miracle of Birth 82 ==================== 83 84 Consider the first power on of a machine. The following sequence will ensue: 85 86 - the machine will broadcast a PXE boot request 87 88 - the **Birther** system will consult the **MDB** and see that the machine is *unknown* 89 90 - the **Birther** will assign a temporary private IP address and create a new machine entry with state *birth* in the **MDB** recording the MAC address and the assigned IP address. It will then send a PXE response to instruct the machine to load and boot the **Bootstrap Image** 91 92 - the machine will boot the kernel in the image 93 94 - the kernel will probe the machine hardware and then start the configuration tool, which will: 95 96 - start a watchdog process that talks to a hardware watchdog device 97 98 - run a burn-in stress and performance test 99 100 - probe the network (using a LLDP query tool or similar) to determine its physical position in the rack and will use this information to compute its hostname and permanent IP address 101 102 - scan the machine hardware and compute a preferred image based on burn-in test results, storage capacity, memory and number of CPUs. Examples of the image types that may be selected are: 103 104 - compute node 105 106 - storage node 107 108 - debug image (if the burn-in test failed) 109 110 - generate random encryption keys for the storage media and store them in NVRAM (discarding any old keys stored there, effectively wiping the media of any old data) 111 112 - partition storage devices 113 114 - create file-systems 115 116 - set up a boot loader 117 118 - mount and populate the new root file-system with system configuration data (/etc/fstab, hostname, network configuration, etc.) 119 120 - copy **subd** and the Certificate Authority file to the root file-system 121 122 - issue a request to the **MDB** to update its entry with the hostname, IP address, system image and set its state to *healthy* 123 124 - if the **MDB** change is successful it will change the network configuration to the permanent IP address, change to the new root directory and transfer control to **subd**. At this point the machine is fully **sub**jugated 125 126 - the **Dominator** will see the new **sub** appear in the **MDB** and will install the system image. The **Dominator** will see that the **sub** is essentially empty and will direct the **sub** to fetch files at maximum speed 127 128 - the **sub** will see that the kernel is being updated (since there are no kernel files currently on the system) and will reboot once the update is complete 129 130 - the **Birther** will see a PXE boot request from the machine, will see that the machine is listed in the **MDB** and is *healthy* and instructs the machine to boot from local media 131 132 - the **sub** will boot its image. Assuming the image is appropriately configured, it is now ready to perform work 133 134 Repairing (rebirthing) Machines 135 ------------------------------- 136 137 If a machine is found to be persistently failing (e.g. stuck in a reboot loop), a separate automated system may decide that a **rebirthing** is required. If so, that system will set the state of the machine in the **MDB** to *rebirth* and on the next reboot the **Birther** will send a PXE boot response to boot the **Bootstrap Image**. The flow is almost the same as above for **birthing** machines, with the following exceptions: 138 139 - the permanent IP address is used in the PXE response 140 141 The means of detecting unhealthy machines and determining how sick they are and the steps required to heal them is the topic of another paper about **Machine Lifecycle Management**. The **Birther** and the **Dominator** are foundational components in a larger system. 142 143 Cleaning Machines 144 ----------------- 145 146 **Cleaning** a machine is almost identical to **rebirthing**, except that the burn-in test is not performed. This is useful if a machine is re-assigned to a different owner so that any potentially sensitive data are removed before the machine is available to the new owner. The burn-in test is not needed (the machine is *healthy*), so it is best to avoid that step (which can take many minutes or even hours, depending on how exhaustive the test is). A fast re-assignment facilitates building responsive Metal as a Service system, if so desired. 147 148 In the simplest case, data can be “cleaned” by re-making the file-systems. This limits the potential for data exfiltration to more advanced attackers. If the secure encryption features of the storage media are used, throwing away the old encryption keys is a fast and effective method to effectively erase the storage media. 149 150 Calculating Performance Targets 151 =============================== 152 153 One of the limitations on birthing machines is how quickly they can fetch the **Bootstrap Image** from the **Boot** server. Considering the following environment: 154 155 - 1 GB/s (10 Gb/s) network 156 157 - 10 MB **Bootstrap Image** 158 159 - 1 GB system image 160 161 the **Boot** server should be able to service 100 fetches per second. This is much faster than the **Dominator** can perform full system updates on (its limit is 1 machine per second, assuming it does not have any peer-to-peer enhancements). Clearly, optimising the **Birther** system would be premature, and will probably never be needed.