github.com/cloud-foundations/dominator@v0.0.0-20221004181915-6e4fee580046/design-docs/MachineBirthing/README.md (about)

     1  Machine Birthing
     2  ================
     3  Richard Gooch
     4  -------------
     5  
     6  Background
     7  ==========
     8  
     9  Growing machine capacity in a datacentre environment is often done by rolling in multiple racks of machines, wiring them in and powering them up. Once powered up, it is common for operations staff to use scripts and other automation tools to *birth* (install and configure) the machines. These automation tools typically build on top of other tools which were designed to birth a single machine (i.e. boot from an installation CD/ISO image). The layers of tools can make the birthing process less reliable and efficient, and leave the machine in a state where it is ready for further configuration rather than being ready for actual useful work. These tools often neglect other aspects of the machine life-cycle, such as automated repairs.
    10  
    11  This document describes the design of a fully automated, robust, reliable and efficient architecture for (re)birthing machines at large scale. The design target is that 100 racks of machines can be turned on and within an hour all the new machines are available for real work, *without any further human intervention* nor any preparatory software configuration. The software system that will implement this architecture is called the **Birther**.
    12  
    13  The **Birther** system depends on the [**Dominator**](../Dominator/README.md) system, which is likely to be the limiting factor in how quickly machines can be made available for real work. A more focussed design target for the **Birther** system is that it can **sub**ject more machines per second to **Domination** than the **Dominator** can complete a full system update on per second.
    14  
    15  High-level Design
    16  =================
    17  
    18  The system is comprised of the following components:
    19  
    20  -   a **M**achine **D**ata**B**ase (**MDB**) which lists all the machines in the fleet and their properties
    21  
    22  -   a **Birther** machine which responds to PXE boot requests from machines
    23  
    24  -   a **Boot** **Server** (containing a DHCP and TFTP server), which is used to install a tiny **Bootstrap Image**
    25  
    26  -   a **Bootstrap Image** which configures the machine, enters it into the **MDB** and enables the machine for **Domination**
    27  
    28  -   a [**Dominator**](../Dominator/README.md) system which is used to install fully configured, workload-ready images
    29  
    30  The following diagram shows how these components are connected:
    31  ![BirtherSystem Components image](../pictures/BirtherSystemComponents.svg)
    32  
    33  The MDB
    34  -------
    35  
    36  The **MDB** is the sole source of truth which defines the intended state of the fleet. It lists all the known machines in the fleet and records the name, IP address, MAC address, *required* system image, repair state and so on.
    37  
    38  The Birther
    39  -----------
    40  
    41  The **Birther** listens for PXE boot requests from any machine, and consults the **MDB** to determine what kind of response to send. In all cases, a response is sent. The following **MDB** states are defined:
    42  
    43  -   *unknown*: the system is not yet known in the **MDB**
    44  
    45  -   *birth*: a temporary private IP address is assigned and the PXE response instructs the machine to load and boot the **Bootstrap Image**
    46  
    47  -   *healthy*: the system is known in the **MDB** and is considered healthy. The PXE response instructs the machine to boot from local media. This is an optimisation that dramatically decreases system reboot time, as the machine does not have to wait for the PXE boot timeout before booting from local media
    48  
    49  -   *rebirth*: the system is known in the **MDB** and is in need of a software repair (a **rebirth**). The permanent IP address is assigned and the PXE response instructs the machine to load and boot the **Bootstrap Image**
    50  
    51  -   *clean*: the system is known in the **MDB** and needs to be cleaned (old data removed). The permanent IP address is assigned and the PXE response instructs the machine to load and boot the **Fast Bootstrap Image**
    52  
    53  The **Birther** stores PXE boot request and response statistics in the **MDB** so that persistently failing machines can be detected.
    54  
    55  The Boot Server
    56  ---------------
    57  
    58  The **Boot Server** contains a DHCP and TFTP server and serves requests for the **Bootstrap Image**. It will respond to requests on the private IP network used for temporary addresses as well as requests on the main IP network for permanent addresses. The Hypervisor in SmallStack (part of the [**Dominator**](../Dominator/README.md) ecosystem) contains a **Boot Server** which is integrated with the ecosystem (including image building and distribution). Consult your favourite search engine for generic implementations.
    59  
    60  The Bootstrap Image
    61  -------------------
    62  
    63  The **Bootstrap Image** contains:
    64  
    65  -   a generic kernel
    66  
    67  -   a small compressed file-system which contains:
    68  
    69      -   a configuration tool, which is run as the *init* process
    70  
    71      -   a copy of **subd** from the **Dominator** system and a Certificate Authority file
    72  
    73  The configuration tool performs initial setup and then hands the machine over to the **Dominator**.
    74  
    75  ### 
    76  
    77  ### The Fast Bootstrap Image
    78  
    79  This is the same as the **Bootstrap Image** except that a burn-in test is not performed.
    80  
    81  The Miracle of Birth
    82  ====================
    83  
    84  Consider the first power on of a machine. The following sequence will ensue:
    85  
    86  -   the machine will broadcast a PXE boot request
    87  
    88  -   the **Birther** system will consult the **MDB** and see that the machine is *unknown*
    89  
    90  -   the **Birther** will assign a temporary private IP address and create a new machine entry with state *birth* in the **MDB** recording the MAC address and the assigned IP address. It will then send a PXE response to instruct the machine to load and boot the **Bootstrap Image**
    91  
    92  -   the machine will boot the kernel in the image
    93  
    94  -   the kernel will probe the machine hardware and then start the configuration tool, which will:
    95  
    96      -   start a watchdog process that talks to a hardware watchdog device
    97  
    98      -   run a burn-in stress and performance test
    99  
   100      -   probe the network (using a LLDP query tool or similar) to determine its physical position in the rack and will use this information to compute its hostname and permanent IP address
   101  
   102      -   scan the machine hardware and compute a preferred image based on burn-in test results, storage capacity, memory and number of CPUs. Examples of the image types that may be selected are:
   103  
   104          -   compute node
   105  
   106          -   storage node
   107  
   108          -   debug image (if the burn-in test failed)
   109  
   110      -   generate random encryption keys for the storage media and store them in NVRAM (discarding any old keys stored there, effectively wiping the media of any old data)
   111  
   112      -   partition storage devices
   113  
   114      -   create file-systems
   115  
   116      -   set up a boot loader
   117  
   118      -   mount and populate the new root file-system with system configuration data (/etc/fstab, hostname, network configuration, etc.)
   119  
   120      -   copy **subd** and the Certificate Authority file to the root file-system
   121  
   122      -   issue a request to the **MDB** to update its entry with the hostname, IP address, system image and set its state to *healthy*
   123  
   124      -   if the **MDB** change is successful it will change the network configuration to the permanent IP address, change to the new root directory and transfer control to **subd**. At this point the machine is fully **sub**jugated
   125  
   126  -   the **Dominator** will see the new **sub** appear in the **MDB** and will install the system image. The **Dominator** will see that the **sub** is essentially empty and will direct the **sub** to fetch files at maximum speed
   127  
   128  -   the **sub** will see that the kernel is being updated (since there are no kernel files currently on the system) and will reboot once the update is complete
   129  
   130  -   the **Birther** will see a PXE boot request from the machine, will see that the machine is listed in the **MDB** and is *healthy* and instructs the machine to boot from local media
   131  
   132  -   the **sub** will boot its image. Assuming the image is appropriately configured, it is now ready to perform work
   133  
   134  Repairing (rebirthing) Machines
   135  -------------------------------
   136  
   137  If a machine is found to be persistently failing (e.g. stuck in a reboot loop), a separate automated system may decide that a **rebirthing** is required. If so, that system will set the state of the machine in the **MDB** to *rebirth* and on the next reboot the **Birther** will send a PXE boot response to boot the **Bootstrap Image**. The flow is almost the same as above for **birthing** machines, with the following exceptions:
   138  
   139  -   the permanent IP address is used in the PXE response
   140  
   141  The means of detecting unhealthy machines and determining how sick they are and the steps required to heal them is the topic of another paper about **Machine Lifecycle Management**. The **Birther** and the **Dominator** are foundational components in a larger system.
   142  
   143  Cleaning Machines
   144  -----------------
   145  
   146  **Cleaning** a machine is almost identical to **rebirthing**, except that the burn-in test is not performed. This is useful if a machine is re-assigned to a different owner so that any potentially sensitive data are removed before the machine is available to the new owner. The burn-in test is not needed (the machine is *healthy*), so it is best to avoid that step (which can take many minutes or even hours, depending on how exhaustive the test is). A fast re-assignment facilitates building responsive Metal as a Service system, if so desired.
   147  
   148  In the simplest case, data can be “cleaned” by re-making the file-systems. This limits the potential for data exfiltration to more advanced attackers. If the secure encryption features of the storage media are used, throwing away the old encryption keys is a fast and effective method to effectively erase the storage media.
   149  
   150  Calculating Performance Targets
   151  ===============================
   152  
   153  One of the limitations on birthing machines is how quickly they can fetch the **Bootstrap Image** from the **Boot** server. Considering the following environment:
   154  
   155  -   1 GB/s (10 Gb/s) network
   156  
   157  -   10 MB **Bootstrap Image**
   158  
   159  -   1 GB system image
   160  
   161  the **Boot** server should be able to service 100 fetches per second. This is much faster than the **Dominator** can perform full system updates on (its limit is 1 machine per second, assuming it does not have any peer-to-peer enhancements). Clearly, optimising the **Birther** system would be premature, and will probably never be needed.