github.com/cloud-foundations/dominator@v0.0.0-20221004181915-6e4fee580046/design-docs/SmallStack/README.md (about)

     1  SmallStack: Simple, Scalable VM Management
     2  ==========================================
     3  Richard Gooch
     4  -------------
     5  
     6  Abstract
     7  ========
     8  
     9  This paper describes a VM management system for a Private Cloud environment which is simple to configure and deploy, has few dependencies, scales to many thousand physical machines (nodes) with hundreds of VMs per node, is highly reliable and has dynamic IP address allocation. VM create time is as low as one second, which is best in class and approaches container platform performance. This system can integrate closely with the [**Dominator**](https://github.com/Cloud-Foundations/Dominator) ecosystem which provides [manifest driven image generation](https://github.com/Cloud-Foundations/Dominator/blob/master/user-guide/image-manifest.md) with the [Imaginator](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imaginator/README.md), high performance image distribution and image-based patching. While you can easily create pets, you also get the tools to farm cattle. By leveraging [Keymaster](https://github.com/Symantec/keymaster), existing organisation/corporate identities may be used for strong authentication (2FA, ephemeral credentials).
    10  
    11  Background
    12  ==========
    13  
    14  Multiple solutions exist for managing Virtual Machines, but each has its own drawbacks:
    15  
    16  -   Expensive, no native metadata service, not Open Source (VMware)
    17  
    18  -   Complex to configure, deploy, maintain and lack performance and reliability ([OpenStack](https://www.openstack.org/))
    19  
    20  -   Limited to small numbers of nodes, no metadata service ([proxmox](https://www.proxmox.com/en/), [Ganeti](http://www.ganeti.org/))
    21  
    22  -   Medium complexity, no metadata service, VMs limited to TCP traffic and reliant on SDN/proxies ([virtlet](https://www.mirantis.com/blog/virtlet-vms-containers-opencontrail-network-kubernetes-nfv/))
    23  
    24  While much computing workload is migrating to the Public Cloud, there remains a need for on-premise VM capacity. The goal is a cost effective, performant and reliable Private Cloud that lacks the bells and whistles of Public Cloud yet is simple and provides a reliable foundation for baseline computing workload and building add-on services if needed, *without compromising the robustness of the foundational platform*.
    25  
    26  The target audience for this system is the medium to large enterprise, yet it is designed to be so easy to configure, deploy and operate that a small enterprise (which often has between zero and one Operations staff) or hobbyist can confidently configure and use it.
    27  
    28  Currently out of Scope
    29  ----------------------
    30  
    31  -   Software Defined Networking (SDN). This is needed in the Public Cloud, as each customer has to be completely isolated and hidden from other customers. In a Private Cloud, this has marginal value and imposes complexity, performance and reliability costs. Depending on your network topology, each VM can route to every other VM (open network) or is blocked (partitioned network)
    32  
    33  -   Software Firewalls (aka. Security Groups). Similar to SDN, Software Firewalls provided by the platform (the hypervisors) increase complexity and reduce performance and reliability. Further complexity would be required to correctly attribute the resource costs for software firewalls. The responsibility for network protection is left to the VM users, such as by using iptables
    34  
    35  -   Remote storage for VMs (i.e. remote volumes). Again, this would increase the complexity of the platform, reduce reliability and dramatically reduce performance of those VMs. VM users can deploy the remote storage solution that fits their needs. VM users who are satisfied with local storage can enjoy a more robust platform. If there is sufficient demand, support for GlusterFS volumes may be added (management of GlusterFS would remain out-of-scope)
    36  
    37  -   Live Migration. This is tricky to get right, and has marginal value. Non-live migration is supported, however
    38  
    39  -   Load Balancers. These introduce complexity and may *reduce* reliability, so the platform does not provide these. These should be provided by the user inside their VM(s). A well architected client-server system does not need a Load Balancer, as the client(s) should be smart and automatically fail over to a working server. Simple-minded architectures rely on Load Balancers to implement High Availability, thus the Load Balancer becomes a Single Point Of Failure (SPOF) and has to be provisioned/scaled in order to handle peak demand
    40  
    41  -   Machine Learning. Speech recognition. Serverless. This project is not trying to (and cannot) compete with the leading Public Cloud offerings. We love to KISS (Keep It Simple, Stupid)
    42  
    43  Finally, while SmallStack includes support for installing physical machines, it is not intended to provide a generic Metal as a Service platform such as [Digital Rebar](https://rebar.digital/). At the physical layer, the focus is on installing and managing the life-cycle of Hypervisors.
    44  
    45  Design
    46  ======
    47  
    48  A core invariant is that *every node (physical machine) and VM (virtual machine) has a unique IP address*. The IP address is the primary key by which machines are identified.
    49  
    50  Core Components
    51  ---------------
    52  
    53  This system has three main components:
    54  
    55  -   The Hypervisor
    56  
    57  -   The Fleet Manager
    58  
    59  -   The vm-control utility (or API)
    60  
    61  All components are simple-to-deploy Go binaries, built from Open Source software.
    62  
    63  ### The Hypervisor
    64  
    65  This is an agent that runs on each physical node. It has the following responsibilities and components:
    66  
    67  -   Uses QEMU+KVM to run VMs
    68  
    69  -   Manages the node capacity (CPU, RAM and local storage)
    70  
    71  -   Contains a built-in DHCP server to provide network configuration information to the VMs and for installing other Hypervisors via PXE boot
    72  
    73  -   Contains a built-in TFTP server which may be used for [birthing](../MachineBirthing/README.md) physical machines (i.e. other Hypervisors) via PXE boot
    74  
    75  -   Contains a metadata server (on the 169.254.169.254 link-local address) which can provide other configuration information and credentials to the VMs. [Appendix 1: Metadata Server](#_m32qdtj523bz) contains more information
    76  
    77  -   Object Cache which caches some commonly-used objects in the [**Dominator**](../Dominator/README.md) ecosystem images. This optional cache improves the performance of creating and updating VMs using these images
    78  
    79  A diagram is shown below:
    80  ![Hypervisor image](../pictures/Hypervisor.svg)
    81  
    82  Configuration of the Hypervisor is minimal: the directory to store saved state and the location of an optional [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) from where images may be fetched from. Requests to launch VMs are made directly to the Hypervisor by the vm-control utility (or API); The Fleet Manager is not involved in this transaction, although it may (optionally) be used to easily find a Hypervisor with available capacity.
    83  
    84  The user requests the creation of a VM of a specified size and if the Hypervisor has sufficient capacity, the VM is created and the user is given the IP address of the new VM in the response, otherwise an error is returned.
    85  
    86  The Hypervisor starts the QEMU process, which detaches and runs in the background. The Hypervisor communicates with the QEMU monitor process over a Unix socket. If the Hypervisor process hangs or crashes, there is no effect on the running VMs. It will reconnect to the monitor socket at startup. Thus, the Hypervisor can be upgraded without any impact to running VMs.
    87  
    88  The Hypervisor probes the node at startup to determine the total machine capacity. When the Hypervisor is first created, it is initially unable to create VMs because it does not know which IP and MAC addresses are available to allocate to VMs. The Fleet Manager will provide a list (pool) of IP and MAC addresses to the Hypervisor. Once the Hypervisor has this list, it can create VMs until this pool is depleted. The pool is continuously replenished by the Fleet Manager. When a VM is destroyed, the IP and MAC addresses are returned to the pool, and are available for immediate reuse.
    89  
    90  ### The Fleet Manager
    91  
    92  The Fleet Manager performs several functions:
    93  
    94  -   Address pool replenishment
    95  
    96  -   Poll Hypervisors to find VMs and monitor utilisation metrics
    97  
    98  -   Provide a directory service for the vm-control utility or API
    99  
   100  -   Manage VM snapshots (backups) of VM volumes
   101  
   102  The Fleet Manager reads the configuration of the fleet (typically from a URL pointing to a directory tree in a Git repository). This configuration includes:
   103  
   104  -   Physical groupings of machines, such as:
   105  
   106      -   region
   107  
   108      -   building
   109  
   110      -   isle
   111  
   112      -   rack
   113  
   114  -   Routing, VLAN and subnet mappings
   115  
   116  The topology is discussed in more detail below.
   117  
   118  The scope of the Fleet Manager is deliberately limited so that it is reliable and performant, even when managing very large fleets.
   119  
   120  #### Address Pool Replenishment
   121  
   122  The Fleet Manager has the essential function of monitoring the spare pool of IP and MAC addresses that each Hypervisor has and replenishing those pools when they fall below a threshold. The thresholds at which to replenish or reclaim the Hypervisor address pools are configurable, with the following defaults:
   123  
   124  -   Desired number of free addresses: 16
   125  
   126  -   Threshold below which the pool is replenished with more addresses (low watermark): 8
   127  
   128  -   Threshold above which free addresses are reclaimed (high watermark): 24
   129  
   130  Using the fleet topology and this configuration it computes IP and MAC address blocks that may be assigned to different groups of Hypervisors and hands them out in small chunks.
   131  
   132  The Fleet Manager is not directly involved in VM creation and thus does not present a single point of failure (SPOF), provided Hypervisors have available IP and MAC addresses in their pools. Since the replenishment threshold is configurable, choosing a large value such as 256 would likely ensure that Hypervisors do not exhaust their pools, even if the Fleet Manager is unavailable indefinitely.
   133  
   134  #### Polling Hypervisors
   135  
   136  In addition to replenishing the address pools for Hypervisors, the Fleet Manager also receives VM create, change and destroy events from the Hypervisors. It additionally polls the Hypervisors for utilisation metrics. It maintains a global view of all the VMs in the fleet, their placement and their utilisation. This global view information is made available as a dashboard and via a RPC protocol for other tools and systems.
   137  
   138  #### Directory Service
   139  
   140  The Fleet Manager provides a directory service for the vm-control utility or API, so that the utility knows where to find a Hypervisor with available capacity. This service is a (significant) convenience, but is not essential. If the DNS name or address of a Hypervisor with available capacity is known through some other means, the vm-control utility can be provided that information.
   141  
   142  #### Manage VM Snapshots
   143  
   144  When a VM is created, an optional automated snapshot (backup) schedule may be specified. The Fleet Manager will instruct Hypervisors to perform snapshots of the local storage volumes for these VMs and will upload the snapshots to a remote object store such as GlusterFS or AWS S3. The data are encrypted by the Hypervisor prior to uploading. The orchestration of snapshotting is centrally managed so that global rate limits and load management may be enforced.
   145  
   146  #### High Availability
   147  
   148  As discussed above, the Fleet Manager is not essential to either the health of VMs nor for management of VMs, but it is very convenient for the latter. A highly available service using round-robin DNS may be implemented by running multiple Fleet Manager instances, with only one configured to manage the Hypervisors (updating address pools and subnets) and the rest only providing directory services. For each Fleet Manager instance, the IP address is stored in a DNS A record for the Fleet Manager FQDN (i.e. fleet-manager.company.com). Clients such as the vm-control utility or a web browser will automatically connect to a working instance. No load balancer is required, instead the tool/web browser will time out a connection attempt to an unresponsive Fleet Manager instance and try another instance listed in the DNS record.
   149  
   150  A diagram is shown below:
   151  ![SmallStack Components image](../pictures/SmallStackComponents.svg)
   152  
   153  ### The vm-control utility/API
   154  
   155  The [vm-control](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/vm-control/README.md) utility orchestrates the creation, starting, stopping and destruction of VMs. It typically consults the Fleet Manager to obtain the global view of Hypervisors, their physical location (i.e. failure zones), available capacity and the placement of VMs. This global view is used to find the required Hypervisor. If the address of the Hypervisor is provided, then the Fleet Manager is not consulted. Typical VM creation options that are supported are:
   156  
   157  -   Create VM of specified size anywhere
   158  
   159  -   Create one VM per rack or isle
   160  
   161  -   Create VM in the same Hypervisor as a specified VM
   162  
   163  -   Create VM in the same rack as a specified VM
   164  
   165  -   Create VM in a different rack/isle as a specified VM
   166  
   167  -   Create VM using the same configuration (size, image) as a specified VM
   168  
   169  -   Create a VM from a snapshot of another VM
   170  
   171  -   Migrate a VM to another Hypervisor
   172  
   173  Since all the intelligence of VM orchestration is the responsibility of the vm-control utility, more advanced features such as VM migration and rolling migrations can be added without risking the health of the fleet; neither the Hypervisor or the Fleet Manager require new code or extra complexity. They only perform some basic services and implement simple primitives. Different users can experiment with new orchestration features, independently, *without compromising the reliability or integrity of the platform*. Integration with other systems (such as updating a Machine DataBase or DNS records) can be added by the user, with the potential for different systems to be integrated by different users.
   174  
   175  Other VM management operations include:
   176  
   177  -   Create an unscheduled snapshot of a VM
   178  
   179  -   Replace the root image (volume) of a VM
   180  
   181  -   Patch the root image (volume) of a VM
   182  
   183  -   Restore the root image (volume) of a VM with the previous volume
   184  
   185  -   Stop a VM, preserving the volume(s) on the Hypervisor
   186  
   187  -   Stop a VM, snapshot the volume(s) and delete from the Hypervisor
   188  
   189  -   Destroy a VM
   190  
   191  -   Destroy a VM and all its snapshots
   192  
   193  -   Delete snapshot(s) for a VM
   194  
   195  -   Start a VM (from local preserved volume or specified snapshot)
   196  
   197  -   Connect to a serial port on a VM (remote serial console)
   198  
   199  Finally, some low level Hypervisor management operations are supported, which allow for bringing Hypervisors to a useful state even with no Fleet Manager configured:
   200  
   201  -   Add an IP and MAC address to the address pool
   202  
   203  -   Add a subnet
   204  
   205  Please see the [online documentation](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/vm-control/README.md) for usage information.
   206  
   207  Fleet Topology
   208  --------------
   209  
   210  As noted above, the placement and grouping of Hypervisors and subnets must be defined and provided to the Fleet Manager. While some grouping types (such as region, isle and rack) may be nearly universal and applicable, other grouping types (building, cabinet, chassis) may be superfluous and cumbersome if there is a requirement to define them. Hypervisors may be grouped at different levels in the topology than subnets in different environments. Further, it is difficult to anticipate other possible grouping types.
   211  
   212  Rather than pre-defining (hopefully) all potential grouping types or re-writing the topology schema code for each new use-case, the topology is expressed as a file-system hierarchy (a directory tree) which is recursively processed by the Fleet Manager. This approach allows for arbitrary grouping types. Each grouping type is a (sub)directory tree and is called a *location*, which is a generic grouping concept. A location may refer to the entire world, or a specific region, or a specific rack in a specific region, and so on. Subnet definitions may be placed in any location (directory) in the topology. Subnets defined high in the topology cover large parts of the topology (e.g. an entire region or isle) , whereas subnets placed near the bottom of the topology tree cover only small groups (e.g. a single rack or even a single Hypervisor). It is valid to define a high-level (broad) subnet (i.e. for a “management” VLAN) while also defining low-level (narrow) subnets (i.e. for a “production” VLAN) in the same topology.
   213  
   214  ### Example Topology
   215  
   216  An example topology with two large regions (NYC and SJC) and one smaller region (SYD) is available at [https://github.com/Cloud-Foundations/Dominator/tree/master/cmd/fleet-manager/example-topology](https://github.com/Cloud-Foundations/Dominator/tree/master/cmd/fleet-manager/example-topology). Each region has 3 VLANs:
   217  
   218  -   Production: for products serving customers
   219  
   220  -   Infrastructure: for internal infrastructure services
   221  
   222  -   Egress: for VMs which have Internet egress access via a NAT gateway
   223  
   224  VM Migration
   225  ------------
   226  
   227  As mentioned in the non-goals section earlier, live VM migration is not supported. However, migration with restart is supported. The vm-control utility will instruct the *target* Hypervisor to fetch the local storage of the VM from the *source* Hypervisor. This does not interfere with the running VM. Once fetched, the vm-control utility will instruct the *source* Hypervisor to stop the VM, and will then instruct the *target* Hypervisor to fetch any changes (diffs) since the first fetch. This second fetch should be quite fast, since only changes are fetched. Direct Hypervisor to Hypervisor transfer ensures the best performance. The VM is then started on the *target* Hypervisor and destroyed on the *source* Hypervisor. In most cases, the downtime for the VM is approximately the reboot time for the VM, even though the apparent *migration time* may be significantly longer if a significant amount of data need to be moved.
   228  
   229  A more disruptive migration or fleet rebuild may be performed by stopping and snapshotting groups of VMs and later starting VMs (restoring from snapshot) after the rebuild operation has completed.
   230  
   231  By default, VM migration is only possible between Hypervisors on the same subnet, so that the IP address can be preserved. The user can choose to migrate with IP address reassignment. If the vm-control utility is integrated with a DNS update system, a change of IP address may be safe for the service that the VM is running.
   232  
   233  Boot Images
   234  -----------
   235  
   236  The first-class images supported are the [**Dominator**](../Dominator/README.md) ecosystem images. These images are preferred because the [**Dominator**](../Dominator/README.md) ecosystem provides services for easy, reproducible builds, fast distribution of image content and artefact generation for other platforms such as AWS, yielding a true Hybrid Cloud experience for users (developers). They also provide the fastest VM boot experience, typically 5 seconds from the VM create request to when your bootloader is running. These images are Linux only. By encouraging the use of [images built from manifests](https://github.com/Cloud-Foundations/Dominator/blob/master/user-guide/image-manifest.md), it becomes trivially easy to replace or clone VMs across failure domains, whether a different rack, isle, building or geographic region. Furthermore, it is also a simple step to enable the [**Dominator**](../Dominator/README.md) for safe and reliable upgrades, patch management and self-healing.
   237  
   238  If you have non-Linux images or do not want to use images from the [**Dominator**](../Dominator/README.md) ecosystem, there are two other supported options for specifying boot image content:
   239  
   240  -   A local RAW image (a boot disc image). The vm-control utility will stream the image data to the Hypervisor. For good performance, this should be done close to the Hypervisor (another VM on the same Hypervisor is best)
   241  
   242  -   A HTTP/HTTPS URL pointing to a RAW image. In effect, you are providing your own image server. Again, for good performance, the image server should have a fast network connection to the Hypervisor.
   243  
   244  With these two options, you can quickly set up the whole system and use familiar tools, keeping the barrier to entry low. You can upgrade to using [**Dominator**](../Dominator/README.md) images at any time.
   245  
   246  ### cloud-init
   247  
   248  The [cloud-init](https://cloud-init.io/) package allows VMs in a Cloud Platform to automatically configure themselves, using data provided by the Cloud Platform (through a metadata service or a virtual configuration drive). With some simple modifications, [cloud-init](https://cloud-init.io/) can support the SmallStack metadata service, allowing VMs to self-configure in the same way.
   249  
   250  Upgrading VMs
   251  -------------
   252  
   253  Modern DevOps Best Practices for updating services and infrastructure urge the use of immutable infrastructure and strongly discourage logging into machines to perform updates. Rather than update running systems, the philosophy is to deploy new systems (with the latest software), verify and test the new systems and (gradually) replace the old systems with the new ones (such as by redirecting requests/workload from old to new systems). While SmallStack fully supports (and encourages) this model, it is recognised that this model may be more challenging to adopt, particularly in on-premise environments, for various reasons:
   254  
   255  -   Systems may have a large amount of data which are costly or time-consuming to move
   256  
   257  -   The IP addresses of systems may be configured into other dependent systems (i.e. network devices such as routers, firewalls and load balancers)
   258  
   259  SmallStack provides easy to use, leading-edge options for updating systems that retain many of the benefits of the immutable infrastructure model (reproducible deployments, consistency across systems, no partial updates) without the above challenges. There are 3 update modes available:
   260  
   261  -   **Live patching** VMs with the [**Dominator**](../Dominator/README.md). This requires the [subd](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md) agent to run in the VMs. It is the fastest way to update VMs with the least service disruption (updating services are stopped for under a second while critical changes are made). This is limited to Linux VMs as the [subd](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md) agent has not been ported to other operating systems
   262  
   263  -   **Zombie patching** of VMs by using the same image-based patching used by the [**Dominator**](../Dominator/README.md) for the root volume. The VM must be stopped, then the Hypervisor will perform the update on the root volume, after which the VM may be started again. As with live patching, configuration changes and data are not modified. This approach is more disruptive than live patching as the VM needs to be shutdown, upgraded and then started, but does not require running an agent in the VMs. This is limited to VMs which use the Linux ext4fs for the root volume
   264  
   265  -   **Carrousel (rebirth)** of VMs by *replacing* the root volume with a new boot image. A fresh root volume is created from an image source and the root volume for the VM is replaced (while the VM is stopped). The old root volume is preserved in case a rollback (restore) is required. Secondary volumes which typically contain large data stores are unaffected. This approach is the most disruptive as any configuration changes made on the previous root volume will be lost. The image replacement method has the advantage of working for all guest OS types and does not require running an instance of [subd](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md) in the VM or require a specific file-system format for the root volume. To help mitigate the loss of configuration changes, configuration data may be stored in the user data for the VM, which are available from the metadata service. User data are persistent for the lifetime of the VM and are independent of the root volume.
   266  
   267  Upgrading Hypervisors
   268  ---------------------
   269  
   270  The principal challenge in maintaining a Private Cloud lies with managing the system software life-cycle of the infrastructure, particularly the Hypervisors. There is a need for safe and rapid patching capability (for security, bugs or features). Hypervisors cannot be redeployed, as they contain precious data and workloads (VMs) that are costly to move. Since SmallStack evolved out of the [**Dominator**](../Dominator/README.md) ecosystem, image-based live-patching of Hypervisors is not just supported but is the recommended method for life-cycle maintenance. The [hyper-control](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/hyper-control/README.md) utility allows rolling out a [**Dominator**](../Dominator/README.md) ecosystem image to a fleet of Hypervisor in minutes. This rollout uses an [arithmetic progression](https://en.wikipedia.org/wiki/Arithmetic_progression) in a sequence of steps:
   271  
   272  -   First one Hypervisor is upgraded and a health check performed (concurrency=1)
   273  
   274  -   If healthy, two Hypervisors are upgraded (concurrency=2)
   275  
   276  -   For every step that completes (Hypervisors upgraded and health checks pass), the concurrency level is incremented by one before starting the next step
   277  
   278  -   If a health check fails, the rolling upgrade is halted
   279  
   280  The rollout starts slowly, and gains speed as more Hypervisors are successfully upgraded. The number of rollout steps is approximately sqrt(N\*2) where N is the number of Hypervisors. Here are some example rollout times:
   281  
   282  -   100 Hypervisors, no reboot needed (15 second upgrade+test): 3m32s
   283  
   284  -   100 Hypervisors, fast reboot (1 minute upgrade+test): 14m9s
   285  
   286  -   100 Hypervisors, slow reboot (5 minute upgrade+test): 1h11m
   287  
   288  -   10000 Hypervisors, no reboot needed (15 second upgrade+test): 35m21s
   289  
   290  -   10000 Hypervisors, fast reboot (1 minute upgrade+test): 2h21m
   291  
   292  -   10000 Hypervisors, slow reboot (5 minute upgrade+test): 11h47m
   293  
   294  The reboots are required if the kernel on the Hypervisor needs to be upgraded. Since this is less common than upgrading other system software, most fleet upgrades run at the higher speed.
   295  
   296  Since the rollout is fully automated, the burden is low. Security patches can be applied promptly, safely and with confidence.
   297  
   298  Security Model
   299  --------------
   300  
   301  All RPC methods require client-side X509 certificates and are secured with SSL/TLS1.2 or higher. This is the same mechanism used in the rest of the [**Dominator**](../Dominator/README.md) ecosystem. The ephemeral certificates that [Keymaster](https://github.com/Symantec/keymaster) generates may be used directly to identify users and their group memberships which are used to determine whether to grant or deny access to create and manipulate VMs. Access to VMs and subnets is granted based on the identity of the user and their group memberships (i.e. LDAP groups). This simple yet flexible approach leverages existing roles/groupings in an organisation, avoiding the need to maintain a mapping between one authentication and authorisation system and another one.
   302  
   303  Stated simply: SmallStack uses existing Corporate/Organisational identities/credentials.
   304  
   305  Credential Management (Coming soon)
   306  -----------------------------------
   307  
   308  As stated above, SmallStack integrates with a simple, yet very secure system to authenticate users when accessing resources (VMs). In many environments, credentials are required not only for *users* but also *services* (aka automation users). A service requires long-lived credentials in order to continue to function. Unfortunately, these long lived credentials are often poorly managed and are frequently stored in convenient but insecure locations (documents, source code repositories, web servers, S3 buckets, etc.). These unsecured credentials are routinely leaked, leading to system and data compromise.
   309  
   310  SmallStack leverages the cababilities of [Keymaster](https://github.com/Symantec/keymaster) to generate short-term credentials for automation users (please see the section “Automation user support” in the [Keymaster](https://github.com/Symantec/keymaster) design document for details). A user may create a VM and request to assign a *role* to the VM. The role is simply an automation user. The vm-control tool will request a long-term credential for the specified automation user. If granted, this credential is passed to the Hypervisor. The Hypervisor will periodically use this credential to request a short-term credential for the automation user. This credential is provided to the VM via the metadata service. Service code running on the VM thus has access to updated credentials which can be used to authenticate the service to other services. The burden of credential management is removed from users and their deployment tools; instead the SmallStack platform manages their credentials securely and conveniently. This is similar to the assignment of roles to instances (VMs) in AWS.
   311  
   312  ### Associating AWS Roles
   313  
   314  With the above capability to assign an internal ([Keymaster](https://github.com/Symantec/keymaster)) role to a VM, this can be further extended by using the [CloudGate](https://github.com/Symantec/Cloud-Gate) Federated Broker. The internal role is mapped to an AWS role. The Hypervisor can use the ephemeral [Keymaster](https://github.com/Symantec/keymaster) credentials to request AWS API STS access tokens from [CloudGate](https://github.com/Symantec/Cloud-Gate). These STS tokens are provided to the VM via the metadata service, at the same URL as the AWS metadata service. Code running on SmallStack VMs can thus make AWS API calls just as may be done on an AWS instances without any modification of the code required. This allows infrastructure running on-premise in a Private Cloud to integrate more seamlessly with Public Cloud infrastructure. The vm-control tool will request an AWS role to associate with the VM, and the SmallStack Platform will transparently provide and manage the credentials.
   315  
   316  Siloed Networks
   317  ---------------
   318  
   319  In some environments, networks may be siloed from each other for security or compliance reasons. This design allows for central visibility and resource management even in the presence of separated (firewalled) networks, provided that networks do not have overlapping IP addresses. If the Fleet Manager can connect to all the Hypervisors, then the single view of all resources can be preserved. This avoids duplication of resources. Since it is only the vm-control utility which can create or mutate VMs, a user can create or mutate VMs provided their connection to the relevant Hypervisor is not blocked by firewalls.
   320  
   321  It is recommended that Hypervisors in siloed networks can be reached from a common Fleet Manager and that these Hypervisors can connect to a common [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) from the [**Dominator**](../Dominator/README.md) ecosystem. This avoids any duplication of resources yet supports the strong network isolation that some may desire. If snapshots are required, the Hypervisors must be able to connect to the remote storage system(s).
   322  
   323  Containers
   324  ----------
   325  
   326  These are all the rage now. A container cluster can be deployed onto the VMs. A more exciting possibility is integrating a container orchestrator such as Kubernetes with this system. The capacity could be dynamically shared between VMs and container pods, with the containers running directly on the nodes.
   327  
   328  A more advanced integration would be to implement “Container VMs”. In this system, the container orchestrator would use the Hypervisor to launch a VM with a stripped down hosting kernel and container/pod manager. The platform would provide the kernel, thus hiding the details from the user. These containers would enjoy stronger isolation and security properties than normal containers. A container with root access would be safe to run, being isolated in its dedicated VM. Prior work on [Clear Containers](https://lwn.net/Articles/644675/) suggests subsecond container (stripped down VM) start times are feasible.
   329  
   330  Appendix 1: Metadata Server
   331  ===========================
   332  
   333  The metadata server provides a simple information/introspection service to all VMs. It is available on port 80 of the link-local address 169.254.169.254. This may be used by cloud-init to introspect and configure the VM. The following paths are available:
   334  
   335  | Path                                       | Contents                            |
   336  |--------------------------------------------|-------------------------------------|
   337  | /datasource/SmallStack                     | true                                |
   338  | /latest/dynamic/epoch-time                 | Seconds.nanoseconds since the Epoch |
   339  | /latest/dynamic/instance-identity/document | VM information                      |
   340  | /latest/user-data                          | Raw blob of user data               |
   341  
   342  The Hypervisor control port (typically 6976) is also available at the link-local address 169.254.169.254. This allows VMs (with valid identity certificates) to create sibling VMs without needing to know their location in the network topology. An example application of this feature is a builder service orchestrator which creates a sibling VM to build an image with potentially untrusted code.
   343  
   344  Networking Implementation
   345  -------------------------
   346  
   347  The metadata server is part of the Hypervisor process on the host machine, which poses some technical challenges:
   348  
   349  -   The host may not be on the same subnet/VLAN as the VMs
   350  
   351  -   The host may have an existing service on port 80
   352  
   353  To solve this the Hypervisor, for each bridge device:
   354  
   355  -   Creates a new Linux Network Namespace (Linux Namespaces are the foundational technology for Containers). This is the metadata server namespace
   356  
   357  -   Creates a veth (virtual EtherNet) device pair
   358  
   359      -   Moves one side into the metadata namespace and configures it with the link-local address (169.254.169.254)
   360  
   361      -   Attaches the remaining side to the bridge device (in the primary namespace)
   362  
   363  -   Adds routing table entries for all the subnets in the metadata namespace. This allows packets from the metadata server to reach the VMs
   364  
   365  -   Creates a listening socket on port 80 in the metadata namespace
   366  
   367  -   Creates an ebtables PREROUTING chain on the nat table for the bridge device to direct packets for the link-local address to the MAC address of the veth device in the metadata namespace.This allows packets from the VMs (addressed to the metadata service) to reach the metadata server
   368  
   369  Appendix 2: Performance
   370  =======================
   371  
   372  Baseline VM
   373  -----------
   374  
   375  A typical Debian v9 (Stretch) VM with a 1 GiB root volume takes approximately 10 seconds to create, boot up and be ready to accept SSH connections. The Hypervisor CPU is an Intel Xeon E5-2650 v2 @ 2.60GHz with spinning HDD storage. The time is spent in these main activities:
   376  
   377  -   5 seconds fetching the image from the [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) and unpacking into the root volume. The larger the image, the more time will be taken. A future version of the Hypervisor will employ a local object cache to improve this by at least a factor of 2
   378  
   379  -   1 second installing the bootloader (GRUB). A future version of the Hypervisor will support direct booting for Linux kernels
   380  
   381  -   3 seconds between the start of the VM and when the OS requests an IP address via DHCP
   382  
   383  -   1 second for cloud-init and other boot code in the VM to complete and SSH to be ready
   384  
   385  Below is an example log from vm-control creating such a VM with total time taken shown after:
   386  
   387  ```
   388  2018/12/15 10:02:07 creating VM on hyper-567.sjc.prod.company.com:6976
   389  2018/12/15 10:02:07 getting image
   390  2018/12/15 10:02:08 unpacking image
   391  2018/12/15 10:02:13 starting VM
   392  10.2.3.4
   393  2018/12/15 10:02:16 Received DHCP ACK
   394  2018/12/15 10:02:17 /datasource/SmallStack
   395  2018/12/15 10:02:17 /latest/user-data
   396  2018/12/15 10:02:17 /latest/dynamic/instance-identity/document
   397  2018/12/15 10:02:17 /latest/dynamic/instance-identity/document
   398  2018/12/15 10:02:17 /latest/dynamic/instance-identity/document
   399  2018/12/15 10:02:18 Port 22 is open
   400  
   401  real 0m10.254s
   402  user 0m0.016s
   403  sys 0m0.004s
   404  ```
   405  
   406  Optimised VM boot
   407  -----------------
   408  
   409  The above Debian v9 (Stretch) VM configuration takes approximately 7-8 seconds to create, boot up and be ready for SSH connections when using the following optimisations:
   410  
   411  -   Local object cache
   412  
   413  -   Direct kernel booting (skipping bootloader)
   414  
   415  The boot time breakdown is:
   416  
   417  -   2.5 seconds fetching the image from the [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) and unpacking into the root volume
   418  
   419  -   3 seconds between the start of the VM and when the OS requests an IP address via DHCP
   420  
   421  -   1 second for cloud-init and other boot code in the VM to complete and SSH to be ready
   422  
   423  Below is an example log from vm-control creating such a VM with total time taken shown after:
   424  
   425  ```
   426  2019/01/11 08:06:31 creating VM on hyper-567.sjc.prod.company.com:6976
   427  2019/01/11 08:06:31 getting image
   428  2019/01/11 08:06:31 unpacking image: minimal/Debian-9/2019-01-11:07:16:45
   429  2019/01/11 08:06:35 starting VM
   430  10.2.3.4
   431  2019/01/11 08:06:38 Received DHCP ACK
   432  2019/01/11 08:06:38 /datasource/SmallStack
   433  2019/01/11 08:06:38 /latest/user-data
   434  2019/01/11 08:06:39 Port 22 is open
   435  real 0m7.447s
   436  user 0m0.020s
   437  sys 0m0.000s
   438  ```
   439  
   440  Appliance (container) VM
   441  ------------------------
   442  
   443  A simple image with a stripped-down kernel (all drivers built into the kernel, virtio driver), no bootloader, no initrd and an init script which only runs the udhcpc DHCP client takes 1.2 seconds to boot. A future version of the Hypervisor may support [Firecracker](https://github.com/firecracker-microvm/firecracker) ([announced](https://aws.amazon.com/blogs/aws/firecracker-lightweight-virtualization-for-serverless-computing/) by AWS) to further reduce this time.
   444  
   445  Below is an example log from vm-control creating such a VM with total time taken shown after:
   446  
   447  ```
   448  2019/01/12 00:24:43 creating VM on localhost:6976
   449  2019/01/12 00:24:43 getting image
   450  2019/01/12 00:24:43 unpacking image: test/2019-01-12:00:13:20
   451  2019/01/12 00:24:43 starting VM
   452  10.2.3.4
   453  2019/01/12 00:24:44 Received DHCP ACK
   454  real 0m1.216s
   455  user 0m0.019s
   456  sys 0m0.011s
   457  ```