github.com/cloud-foundations/dominator@v0.0.0-20221004181915-6e4fee580046/design-docs/SmallStack/README.md

github.com/cloud-foundations/dominator@v0.0.0-20221004181915-6e4fee580046/design-docs/SmallStack/README.md (about)

1 SmallStack: Simple, Scalable VM Management
2 ==========================================
3 Richard Gooch
4 -------------
5
6 Abstract
7 ========
8
9 This paper describes a VM management system for a Private Cloud environment which is simple to configure and deploy, has few dependencies, scales to many thousand physical machines (nodes) with hundreds of VMs per node, is highly reliable and has dynamic IP address allocation. VM create time is as low as one second, which is best in class and approaches container platform performance. This system can integrate closely with the [**Dominator**](https://github.com/Cloud-Foundations/Dominator) ecosystem which provides [manifest driven image generation](https://github.com/Cloud-Foundations/Dominator/blob/master/user-guide/image-manifest.md) with the [Imaginator](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imaginator/README.md), high performance image distribution and image-based patching. While you can easily create pets, you also get the tools to farm cattle. By leveraging [Keymaster](https://github.com/Symantec/keymaster), existing organisation/corporate identities may be used for strong authentication (2FA, ephemeral credentials).
10
11 Background
12 ==========
13
14 Multiple solutions exist for managing Virtual Machines, but each has its own drawbacks:
15
16 - Expensive, no native metadata service, not Open Source (VMware)
17
18 - Complex to configure, deploy, maintain and lack performance and reliability ([OpenStack](https://www.openstack.org/))
19
20 - Limited to small numbers of nodes, no metadata service ([proxmox](https://www.proxmox.com/en/), [Ganeti](http://www.ganeti.org/))
21
22 - Medium complexity, no metadata service, VMs limited to TCP traffic and reliant on SDN/proxies ([virtlet](https://www.mirantis.com/blog/virtlet-vms-containers-opencontrail-network-kubernetes-nfv/))
23
24 While much computing workload is migrating to the Public Cloud, there remains a need for on-premise VM capacity. The goal is a cost effective, performant and reliable Private Cloud that lacks the bells and whistles of Public Cloud yet is simple and provides a reliable foundation for baseline computing workload and building add-on services if needed, *without compromising the robustness of the foundational platform*.
25
26 The target audience for this system is the medium to large enterprise, yet it is designed to be so easy to configure, deploy and operate that a small enterprise (which often has between zero and one Operations staff) or hobbyist can confidently configure and use it.
27
28 Currently out of Scope
29 ----------------------
30
31 - Software Defined Networking (SDN). This is needed in the Public Cloud, as each customer has to be completely isolated and hidden from other customers. In a Private Cloud, this has marginal value and imposes complexity, performance and reliability costs. Depending on your network topology, each VM can route to every other VM (open network) or is blocked (partitioned network)
32
33 - Software Firewalls (aka. Security Groups). Similar to SDN, Software Firewalls provided by the platform (the hypervisors) increase complexity and reduce performance and reliability. Further complexity would be required to correctly attribute the resource costs for software firewalls. The responsibility for network protection is left to the VM users, such as by using iptables
34
35 - Remote storage for VMs (i.e. remote volumes). Again, this would increase the complexity of the platform, reduce reliability and dramatically reduce performance of those VMs. VM users can deploy the remote storage solution that fits their needs. VM users who are satisfied with local storage can enjoy a more robust platform. If there is sufficient demand, support for GlusterFS volumes may be added (management of GlusterFS would remain out-of-scope)
36
37 - Live Migration. This is tricky to get right, and has marginal value. Non-live migration is supported, however
38
39 - Load Balancers. These introduce complexity and may *reduce* reliability, so the platform does not provide these. These should be provided by the user inside their VM(s). A well architected client-server system does not need a Load Balancer, as the client(s) should be smart and automatically fail over to a working server. Simple-minded architectures rely on Load Balancers to implement High Availability, thus the Load Balancer becomes a Single Point Of Failure (SPOF) and has to be provisioned/scaled in order to handle peak demand
40
41 - Machine Learning. Speech recognition. Serverless. This project is not trying to (and cannot) compete with the leading Public Cloud offerings. We love to KISS (Keep It Simple, Stupid)
42
43 Finally, while SmallStack includes support for installing physical machines, it is not intended to provide a generic Metal as a Service platform such as [Digital Rebar](https://rebar.digital/). At the physical layer, the focus is on installing and managing the life-cycle of Hypervisors.
44
45 Design
46 ======
47
48 A core invariant is that *every node (physical machine) and VM (virtual machine) has a unique IP address*. The IP address is the primary key by which machines are identified.
49
50 Core Components
51 ---------------
52
53 This system has three main components:
54
55 - The Hypervisor
56
57 - The Fleet Manager
58
59 - The vm-control utility (or API)
60
61 All components are simple-to-deploy Go binaries, built from Open Source software.
62
63 ### The Hypervisor
64
65 This is an agent that runs on each physical node. It has the following responsibilities and components:
66
67 - Uses QEMU+KVM to run VMs
68
69 - Manages the node capacity (CPU, RAM and local storage)
70
71 - Contains a built-in DHCP server to provide network configuration information to the VMs and for installing other Hypervisors via PXE boot
72
73 - Contains a built-in TFTP server which may be used for [birthing](../MachineBirthing/README.md) physical machines (i.e. other Hypervisors) via PXE boot
74
75 - Contains a metadata server (on the 169.254.169.254 link-local address) which can provide other configuration information and credentials to the VMs. [Appendix 1: Metadata Server](#_m32qdtj523bz) contains more information
76
77 - Object Cache which caches some commonly-used objects in the [**Dominator**](../Dominator/README.md) ecosystem images. This optional cache improves the performance of creating and updating VMs using these images
78
79 A diagram is shown below:
80 ![Hypervisor image](../pictures/Hypervisor.svg)
81
82 Configuration of the Hypervisor is minimal: the directory to store saved state and the location of an optional [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) from where images may be fetched from. Requests to launch VMs are made directly to the Hypervisor by the vm-control utility (or API); The Fleet Manager is not involved in this transaction, although it may (optionally) be used to easily find a Hypervisor with available capacity.
83
84 The user requests the creation of a VM of a specified size and if the Hypervisor has sufficient capacity, the VM is created and the user is given the IP address of the new VM in the response, otherwise an error is returned.
85
86 The Hypervisor starts the QEMU process, which detaches and runs in the background. The Hypervisor communicates with the QEMU monitor process over a Unix socket. If the Hypervisor process hangs or crashes, there is no effect on the running VMs. It will reconnect to the monitor socket at startup. Thus, the Hypervisor can be upgraded without any impact to running VMs.
87
88 The Hypervisor probes the node at startup to determine the total machine capacity. When the Hypervisor is first created, it is initially unable to create VMs because it does not know which IP and MAC addresses are available to allocate to VMs. The Fleet Manager will provide a list (pool) of IP and MAC addresses to the Hypervisor. Once the Hypervisor has this list, it can create VMs until this pool is depleted. The pool is continuously replenished by the Fleet Manager. When a VM is destroyed, the IP and MAC addresses are returned to the pool, and are available for immediate reuse.
89
90 ### The Fleet Manager
91
92 The Fleet Manager performs several functions:
93
94 - Address pool replenishment
95
96 - Poll Hypervisors to find VMs and monitor utilisation metrics
97
98 - Provide a directory service for the vm-control utility or API
99
100 - Manage VM snapshots (backups) of VM volumes
101
102 The Fleet Manager reads the configuration of the fleet (typically from a URL pointing to a directory tree in a Git repository). This configuration includes:
103
104 - Physical groupings of machines, such as:
105
106 - region
107
108 - building
109
110 - isle
111
112 - rack
113
114 - Routing, VLAN and subnet mappings
115
116 The topology is discussed in more detail below.
117
118 The scope of the Fleet Manager is deliberately limited so that it is reliable and performant, even when managing very large fleets.
119
120 #### Address Pool Replenishment
121
122 The Fleet Manager has the essential function of monitoring the spare pool of IP and MAC addresses that each Hypervisor has and replenishing those pools when they fall below a threshold. The thresholds at which to replenish or reclaim the Hypervisor address pools are configurable, with the following defaults:
123
124 - Desired number of free addresses: 16
125
126 - Threshold below which the pool is replenished with more addresses (low watermark): 8
127
128 - Threshold above which free addresses are reclaimed (high watermark): 24
129
130 Using the fleet topology and this configuration it computes IP and MAC address blocks that may be assigned to different groups of Hypervisors and hands them out in small chunks.
131
132 The Fleet Manager is not directly involved in VM creation and thus does not present a single point of failure (SPOF), provided Hypervisors have available IP and MAC addresses in their pools. Since the replenishment threshold is configurable, choosing a large value such as 256 would likely ensure that Hypervisors do not exhaust their pools, even if the Fleet Manager is unavailable indefinitely.
133
134 #### Polling Hypervisors
135
136 In addition to replenishing the address pools for Hypervisors, the Fleet Manager also receives VM create, change and destroy events from the Hypervisors. It additionally polls the Hypervisors for utilisation metrics. It maintains a global view of all the VMs in the fleet, their placement and their utilisation. This global view information is made available as a dashboard and via a RPC protocol for other tools and systems.
137
138 #### Directory Service
139
140 The Fleet Manager provides a directory service for the vm-control utility or API, so that the utility knows where to find a Hypervisor with available capacity. This service is a (significant) convenience, but is not essential. If the DNS name or address of a Hypervisor with available capacity is known through some other means, the vm-control utility can be provided that information.
141
142 #### Manage VM Snapshots
143
144 When a VM is created, an optional automated snapshot (backup) schedule may be specified. The Fleet Manager will instruct Hypervisors to perform snapshots of the local storage volumes for these VMs and will upload the snapshots to a remote object store such as GlusterFS or AWS S3. The data are encrypted by the Hypervisor prior to uploading. The orchestration of snapshotting is centrally managed so that global rate limits and load management may be enforced.
145
146 #### High Availability
147
148 As discussed above, the Fleet Manager is not essential to either the health of VMs nor for management of VMs, but it is very convenient for the latter. A highly available service using round-robin DNS may be implemented by running multiple Fleet Manager instances, with only one configured to manage the Hypervisors (updating address pools and subnets) and the rest only providing directory services. For each Fleet Manager instance, the IP address is stored in a DNS A record for the Fleet Manager FQDN (i.e. fleet-manager.company.com). Clients such as the vm-control utility or a web browser will automatically connect to a working instance. No load balancer is required, instead the tool/web browser will time out a connection attempt to an unresponsive Fleet Manager instance and try another instance listed in the DNS record.
149
150 A diagram is shown below:
151 ![SmallStack Components image](../pictures/SmallStackComponents.svg)
152
153 ### The vm-control utility/API
154
155 The [vm-control](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/vm-control/README.md) utility orchestrates the creation, starting, stopping and destruction of VMs. It typically consults the Fleet Manager to obtain the global view of Hypervisors, their physical location (i.e. failure zones), available capacity and the placement of VMs. This global view is used to find the required Hypervisor. If the address of the Hypervisor is provided, then the Fleet Manager is not consulted. Typical VM creation options that are supported are:
156
157 - Create VM of specified size anywhere
158
159 - Create one VM per rack or isle
160
161 - Create VM in the same Hypervisor as a specified VM
162
163 - Create VM in the same rack as a specified VM
164
165 - Create VM in a different rack/isle as a specified VM
166
167 - Create VM using the same configuration (size, image) as a specified VM
168
169 - Create a VM from a snapshot of another VM
170
171 - Migrate a VM to another Hypervisor
172
173 Since all the intelligence of VM orchestration is the responsibility of the vm-control utility, more advanced features such as VM migration and rolling migrations can be added without risking the health of the fleet; neither the Hypervisor or the Fleet Manager require new code or extra complexity. They only perform some basic services and implement simple primitives. Different users can experiment with new orchestration features, independently, *without compromising the reliability or integrity of the platform*. Integration with other systems (such as updating a Machine DataBase or DNS records) can be added by the user, with the potential for different systems to be integrated by different users.
174
175 Other VM management operations include:
176
177 - Create an unscheduled snapshot of a VM
178
179 - Replace the root image (volume) of a VM
180
181 - Patch the root image (volume) of a VM
182
183 - Restore the root image (volume) of a VM with the previous volume
184
185 - Stop a VM, preserving the volume(s) on the Hypervisor
186
187 - Stop a VM, snapshot the volume(s) and delete from the Hypervisor
188
189 - Destroy a VM
190
191 - Destroy a VM and all its snapshots
192
193 - Delete snapshot(s) for a VM
194
195 - Start a VM (from local preserved volume or specified snapshot)
196
197 - Connect to a serial port on a VM (remote serial console)
198
199 Finally, some low level Hypervisor management operations are supported, which allow for bringing Hypervisors to a useful state even with no Fleet Manager configured:
200
201 - Add an IP and MAC address to the address pool
202
203 - Add a subnet
204
205 Please see the [online documentation](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/vm-control/README.md) for usage information.
206
207 Fleet Topology
208 --------------
209
210 As noted above, the placement and grouping of Hypervisors and subnets must be defined and provided to the Fleet Manager. While some grouping types (such as region, isle and rack) may be nearly universal and applicable, other grouping types (building, cabinet, chassis) may be superfluous and cumbersome if there is a requirement to define them. Hypervisors may be grouped at different levels in the topology than subnets in different environments. Further, it is difficult to anticipate other possible grouping types.
211
212 Rather than pre-defining (hopefully) all potential grouping types or re-writing the topology schema code for each new use-case, the topology is expressed as a file-system hierarchy (a directory tree) which is recursively processed by the Fleet Manager. This approach allows for arbitrary grouping types. Each grouping type is a (sub)directory tree and is called a *location*, which is a generic grouping concept. A location may refer to the entire world, or a specific region, or a specific rack in a specific region, and so on. Subnet definitions may be placed in any location (directory) in the topology. Subnets defined high in the topology cover large parts of the topology (e.g. an entire region or isle) , whereas subnets placed near the bottom of the topology tree cover only small groups (e.g. a single rack or even a single Hypervisor). It is valid to define a high-level (broad) subnet (i.e. for a “management” VLAN) while also defining low-level (narrow) subnets (i.e. for a “production” VLAN) in the same topology.
213
214 ### Example Topology
215
216 An example topology with two large regions (NYC and SJC) and one smaller region (SYD) is available at [https://github.com/Cloud-Foundations/Dominator/tree/master/cmd/fleet-manager/example-topology](https://github.com/Cloud-Foundations/Dominator/tree/master/cmd/fleet-manager/example-topology). Each region has 3 VLANs:
217
218 - Production: for products serving customers
219
220 - Infrastructure: for internal infrastructure services
221
222 - Egress: for VMs which have Internet egress access via a NAT gateway
223
224 VM Migration
225 ------------
226
227 As mentioned in the non-goals section earlier, live VM migration is not supported. However, migration with restart is supported. The vm-control utility will instruct the *target* Hypervisor to fetch the local storage of the VM from the *source* Hypervisor. This does not interfere with the running VM. Once fetched, the vm-control utility will instruct the *source* Hypervisor to stop the VM, and will then instruct the *target* Hypervisor to fetch any changes (diffs) since the first fetch. This second fetch should be quite fast, since only changes are fetched. Direct Hypervisor to Hypervisor transfer ensures the best performance. The VM is then started on the *target* Hypervisor and destroyed on the *source* Hypervisor. In most cases, the downtime for the VM is approximately the reboot time for the VM, even though the apparent *migration time* may be significantly longer if a significant amount of data need to be moved.
228
229 A more disruptive migration or fleet rebuild may be performed by stopping and snapshotting groups of VMs and later starting VMs (restoring from snapshot) after the rebuild operation has completed.
230
231 By default, VM migration is only possible between Hypervisors on the same subnet, so that the IP address can be preserved. The user can choose to migrate with IP address reassignment. If the vm-control utility is integrated with a DNS update system, a change of IP address may be safe for the service that the VM is running.
232
233 Boot Images
234 -----------
235
236 The first-class images supported are the [**Dominator**](../Dominator/README.md) ecosystem images. These images are preferred because the [**Dominator**](../Dominator/README.md) ecosystem provides services for easy, reproducible builds, fast distribution of image content and artefact generation for other platforms such as AWS, yielding a true Hybrid Cloud experience for users (developers). They also provide the fastest VM boot experience, typically 5 seconds from the VM create request to when your bootloader is running. These images are Linux only. By encouraging the use of [images built from manifests](https://github.com/Cloud-Foundations/Dominator/blob/master/user-guide/image-manifest.md), it becomes trivially easy to replace or clone VMs across failure domains, whether a different rack, isle, building or geographic region. Furthermore, it is also a simple step to enable the [**Dominator**](../Dominator/README.md) for safe and reliable upgrades, patch management and self-healing.
237
238 If you have non-Linux images or do not want to use images from the [**Dominator**](../Dominator/README.md) ecosystem, there are two other supported options for specifying boot image content:
239
240 - A local RAW image (a boot disc image). The vm-control utility will stream the image data to the Hypervisor. For good performance, this should be done close to the Hypervisor (another VM on the same Hypervisor is best)
241
242 - A HTTP/HTTPS URL pointing to a RAW image. In effect, you are providing your own image server. Again, for good performance, the image server should have a fast network connection to the Hypervisor.
243
244 With these two options, you can quickly set up the whole system and use familiar tools, keeping the barrier to entry low. You can upgrade to using [**Dominator**](../Dominator/README.md) images at any time.
245
246 ### cloud-init
247
248 The [cloud-init](https://cloud-init.io/) package allows VMs in a Cloud Platform to automatically configure themselves, using data provided by the Cloud Platform (through a metadata service or a virtual configuration drive). With some simple modifications, [cloud-init](https://cloud-init.io/) can support the SmallStack metadata service, allowing VMs to self-configure in the same way.
249
250 Upgrading VMs
251 -------------
252
253 Modern DevOps Best Practices for updating services and infrastructure urge the use of immutable infrastructure and strongly discourage logging into machines to perform updates. Rather than update running systems, the philosophy is to deploy new systems (with the latest software), verify and test the new systems and (gradually) replace the old systems with the new ones (such as by redirecting requests/workload from old to new systems). While SmallStack fully supports (and encourages) this model, it is recognised that this model may be more challenging to adopt, particularly in on-premise environments, for various reasons:
254
255 - Systems may have a large amount of data which are costly or time-consuming to move
256
257 - The IP addresses of systems may be configured into other dependent systems (i.e. network devices such as routers, firewalls and load balancers)
258
259 SmallStack provides easy to use, leading-edge options for updating systems that retain many of the benefits of the immutable infrastructure model (reproducible deployments, consistency across systems, no partial updates) without the above challenges. There are 3 update modes available:
260
261 - **Live patching** VMs with the [**Dominator**](../Dominator/README.md). This requires the [subd](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md) agent to run in the VMs. It is the fastest way to update VMs with the least service disruption (updating services are stopped for under a second while critical changes are made). This is limited to Linux VMs as the [subd](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md) agent has not been ported to other operating systems
262
263 - **Zombie patching** of VMs by using the same image-based patching used by the [**Dominator**](../Dominator/README.md) for the root volume. The VM must be stopped, then the Hypervisor will perform the update on the root volume, after which the VM may be started again. As with live patching, configuration changes and data are not modified. This approach is more disruptive than live patching as the VM needs to be shutdown, upgraded and then started, but does not require running an agent in the VMs. This is limited to VMs which use the Linux ext4fs for the root volume
264
265 - **Carrousel (rebirth)** of VMs by *replacing* the root volume with a new boot image. A fresh root volume is created from an image source and the root volume for the VM is replaced (while the VM is stopped). The old root volume is preserved in case a rollback (restore) is required. Secondary volumes which typically contain large data stores are unaffected. This approach is the most disruptive as any configuration changes made on the previous root volume will be lost. The image replacement method has the advantage of working for all guest OS types and does not require running an instance of [subd](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md) in the VM or require a specific file-system format for the root volume. To help mitigate the loss of configuration changes, configuration data may be stored in the user data for the VM, which are available from the metadata service. User data are persistent for the lifetime of the VM and are independent of the root volume.
266
267 Upgrading Hypervisors
268 ---------------------
269
270 The principal challenge in maintaining a Private Cloud lies with managing the system software life-cycle of the infrastructure, particularly the Hypervisors. There is a need for safe and rapid patching capability (for security, bugs or features). Hypervisors cannot be redeployed, as they contain precious data and workloads (VMs) that are costly to move. Since SmallStack evolved out of the [**Dominator**](../Dominator/README.md) ecosystem, image-based live-patching of Hypervisors is not just supported but is the recommended method for life-cycle maintenance. The [hyper-control](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/hyper-control/README.md) utility allows rolling out a [**Dominator**](../Dominator/README.md) ecosystem image to a fleet of Hypervisor in minutes. This rollout uses an [arithmetic progression](https://en.wikipedia.org/wiki/Arithmetic_progression) in a sequence of steps:
271
272 - First one Hypervisor is upgraded and a health check performed (concurrency=1)
273
274 - If healthy, two Hypervisors are upgraded (concurrency=2)
275
276 - For every step that completes (Hypervisors upgraded and health checks pass), the concurrency level is incremented by one before starting the next step
277
278 - If a health check fails, the rolling upgrade is halted
279
280 The rollout starts slowly, and gains speed as more Hypervisors are successfully upgraded. The number of rollout steps is approximately sqrt(N\*2) where N is the number of Hypervisors. Here are some example rollout times:
281
282 - 100 Hypervisors, no reboot needed (15 second upgrade+test): 3m32s
283
284 - 100 Hypervisors, fast reboot (1 minute upgrade+test): 14m9s
285
286 - 100 Hypervisors, slow reboot (5 minute upgrade+test): 1h11m
287
288 - 10000 Hypervisors, no reboot needed (15 second upgrade+test): 35m21s
289
290 - 10000 Hypervisors, fast reboot (1 minute upgrade+test): 2h21m
291
292 - 10000 Hypervisors, slow reboot (5 minute upgrade+test): 11h47m
293
294 The reboots are required if the kernel on the Hypervisor needs to be upgraded. Since this is less common than upgrading other system software, most fleet upgrades run at the higher speed.
295
296 Since the rollout is fully automated, the burden is low. Security patches can be applied promptly, safely and with confidence.
297
298 Security Model
299 --------------
300
301 All RPC methods require client-side X509 certificates and are secured with SSL/TLS1.2 or higher. This is the same mechanism used in the rest of the [**Dominator**](../Dominator/README.md) ecosystem. The ephemeral certificates that [Keymaster](https://github.com/Symantec/keymaster) generates may be used directly to identify users and their group memberships which are used to determine whether to grant or deny access to create and manipulate VMs. Access to VMs and subnets is granted based on the identity of the user and their group memberships (i.e. LDAP groups). This simple yet flexible approach leverages existing roles/groupings in an organisation, avoiding the need to maintain a mapping between one authentication and authorisation system and another one.
302
303 Stated simply: SmallStack uses existing Corporate/Organisational identities/credentials.
304
305 Credential Management (Coming soon)
306 -----------------------------------
307
308 As stated above, SmallStack integrates with a simple, yet very secure system to authenticate users when accessing resources (VMs). In many environments, credentials are required not only for *users* but also *services* (aka automation users). A service requires long-lived credentials in order to continue to function. Unfortunately, these long lived credentials are often poorly managed and are frequently stored in convenient but insecure locations (documents, source code repositories, web servers, S3 buckets, etc.). These unsecured credentials are routinely leaked, leading to system and data compromise.
309
310 SmallStack leverages the cababilities of [Keymaster](https://github.com/Symantec/keymaster) to generate short-term credentials for automation users (please see the section “Automation user support” in the [Keymaster](https://github.com/Symantec/keymaster) design document for details). A user may create a VM and request to assign a *role* to the VM. The role is simply an automation user. The vm-control tool will request a long-term credential for the specified automation user. If granted, this credential is passed to the Hypervisor. The Hypervisor will periodically use this credential to request a short-term credential for the automation user. This credential is provided to the VM via the metadata service. Service code running on the VM thus has access to updated credentials which can be used to authenticate the service to other services. The burden of credential management is removed from users and their deployment tools; instead the SmallStack platform manages their credentials securely and conveniently. This is similar to the assignment of roles to instances (VMs) in AWS.
311
312 ### Associating AWS Roles
313
314 With the above capability to assign an internal ([Keymaster](https://github.com/Symantec/keymaster)) role to a VM, this can be further extended by using the [CloudGate](https://github.com/Symantec/Cloud-Gate) Federated Broker. The internal role is mapped to an AWS role. The Hypervisor can use the ephemeral [Keymaster](https://github.com/Symantec/keymaster) credentials to request AWS API STS access tokens from [CloudGate](https://github.com/Symantec/Cloud-Gate). These STS tokens are provided to the VM via the metadata service, at the same URL as the AWS metadata service. Code running on SmallStack VMs can thus make AWS API calls just as may be done on an AWS instances without any modification of the code required. This allows infrastructure running on-premise in a Private Cloud to integrate more seamlessly with Public Cloud infrastructure. The vm-control tool will request an AWS role to associate with the VM, and the SmallStack Platform will transparently provide and manage the credentials.
315
316 Siloed Networks
317 ---------------
318
319 In some environments, networks may be siloed from each other for security or compliance reasons. This design allows for central visibility and resource management even in the presence of separated (firewalled) networks, provided that networks do not have overlapping IP addresses. If the Fleet Manager can connect to all the Hypervisors, then the single view of all resources can be preserved. This avoids duplication of resources. Since it is only the vm-control utility which can create or mutate VMs, a user can create or mutate VMs provided their connection to the relevant Hypervisor is not blocked by firewalls.
320
321 It is recommended that Hypervisors in siloed networks can be reached from a common Fleet Manager and that these Hypervisors can connect to a common [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) from the [**Dominator**](../Dominator/README.md) ecosystem. This avoids any duplication of resources yet supports the strong network isolation that some may desire. If snapshots are required, the Hypervisors must be able to connect to the remote storage system(s).
322
323 Containers
324 ----------
325
326 These are all the rage now. A container cluster can be deployed onto the VMs. A more exciting possibility is integrating a container orchestrator such as Kubernetes with this system. The capacity could be dynamically shared between VMs and container pods, with the containers running directly on the nodes.
327
328 A more advanced integration would be to implement “Container VMs”. In this system, the container orchestrator would use the Hypervisor to launch a VM with a stripped down hosting kernel and container/pod manager. The platform would provide the kernel, thus hiding the details from the user. These containers would enjoy stronger isolation and security properties than normal containers. A container with root access would be safe to run, being isolated in its dedicated VM. Prior work on [Clear Containers](https://lwn.net/Articles/644675/) suggests subsecond container (stripped down VM) start times are feasible.
329
330 Appendix 1: Metadata Server
331 ===========================
332
333 The metadata server provides a simple information/introspection service to all VMs. It is available on port 80 of the link-local address 169.254.169.254. This may be used by cloud-init to introspect and configure the VM. The following paths are available:
334
335 | Path | Contents |
336 |--------------------------------------------|-------------------------------------|
337 | /datasource/SmallStack | true |
338 | /latest/dynamic/epoch-time | Seconds.nanoseconds since the Epoch |
339 | /latest/dynamic/instance-identity/document | VM information |
340 | /latest/user-data | Raw blob of user data |
341
342 The Hypervisor control port (typically 6976) is also available at the link-local address 169.254.169.254. This allows VMs (with valid identity certificates) to create sibling VMs without needing to know their location in the network topology. An example application of this feature is a builder service orchestrator which creates a sibling VM to build an image with potentially untrusted code.
343
344 Networking Implementation
345 -------------------------
346
347 The metadata server is part of the Hypervisor process on the host machine, which poses some technical challenges:
348
349 - The host may not be on the same subnet/VLAN as the VMs
350
351 - The host may have an existing service on port 80
352
353 To solve this the Hypervisor, for each bridge device:
354
355 - Creates a new Linux Network Namespace (Linux Namespaces are the foundational technology for Containers). This is the metadata server namespace
356
357 - Creates a veth (virtual EtherNet) device pair
358
359 - Moves one side into the metadata namespace and configures it with the link-local address (169.254.169.254)
360
361 - Attaches the remaining side to the bridge device (in the primary namespace)
362
363 - Adds routing table entries for all the subnets in the metadata namespace. This allows packets from the metadata server to reach the VMs
364
365 - Creates a listening socket on port 80 in the metadata namespace
366
367 - Creates an ebtables PREROUTING chain on the nat table for the bridge device to direct packets for the link-local address to the MAC address of the veth device in the metadata namespace.This allows packets from the VMs (addressed to the metadata service) to reach the metadata server
368
369 Appendix 2: Performance
370 =======================
371
372 Baseline VM
373 -----------
374
375 A typical Debian v9 (Stretch) VM with a 1 GiB root volume takes approximately 10 seconds to create, boot up and be ready to accept SSH connections. The Hypervisor CPU is an Intel Xeon E5-2650 v2 @ 2.60GHz with spinning HDD storage. The time is spent in these main activities:
376
377 - 5 seconds fetching the image from the [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) and unpacking into the root volume. The larger the image, the more time will be taken. A future version of the Hypervisor will employ a local object cache to improve this by at least a factor of 2
378
379 - 1 second installing the bootloader (GRUB). A future version of the Hypervisor will support direct booting for Linux kernels
380
381 - 3 seconds between the start of the VM and when the OS requests an IP address via DHCP
382
383 - 1 second for cloud-init and other boot code in the VM to complete and SSH to be ready
384
385 Below is an example log from vm-control creating such a VM with total time taken shown after:
386
387 ```
388 2018/12/15 10:02:07 creating VM on hyper-567.sjc.prod.company.com:6976
389 2018/12/15 10:02:07 getting image
390 2018/12/15 10:02:08 unpacking image
391 2018/12/15 10:02:13 starting VM
392 10.2.3.4
393 2018/12/15 10:02:16 Received DHCP ACK
394 2018/12/15 10:02:17 /datasource/SmallStack
395 2018/12/15 10:02:17 /latest/user-data
396 2018/12/15 10:02:17 /latest/dynamic/instance-identity/document
397 2018/12/15 10:02:17 /latest/dynamic/instance-identity/document
398 2018/12/15 10:02:17 /latest/dynamic/instance-identity/document
399 2018/12/15 10:02:18 Port 22 is open
400
401 real 0m10.254s
402 user 0m0.016s
403 sys 0m0.004s
404 ```
405
406 Optimised VM boot
407 -----------------
408
409 The above Debian v9 (Stretch) VM configuration takes approximately 7-8 seconds to create, boot up and be ready for SSH connections when using the following optimisations:
410
411 - Local object cache
412
413 - Direct kernel booting (skipping bootloader)
414
415 The boot time breakdown is:
416
417 - 2.5 seconds fetching the image from the [imageserver](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md) and unpacking into the root volume
418
419 - 3 seconds between the start of the VM and when the OS requests an IP address via DHCP
420
421 - 1 second for cloud-init and other boot code in the VM to complete and SSH to be ready
422
423 Below is an example log from vm-control creating such a VM with total time taken shown after:
424
425 ```
426 2019/01/11 08:06:31 creating VM on hyper-567.sjc.prod.company.com:6976
427 2019/01/11 08:06:31 getting image
428 2019/01/11 08:06:31 unpacking image: minimal/Debian-9/2019-01-11:07:16:45
429 2019/01/11 08:06:35 starting VM
430 10.2.3.4
431 2019/01/11 08:06:38 Received DHCP ACK
432 2019/01/11 08:06:38 /datasource/SmallStack
433 2019/01/11 08:06:38 /latest/user-data
434 2019/01/11 08:06:39 Port 22 is open
435 real 0m7.447s
436 user 0m0.020s
437 sys 0m0.000s
438 ```
439
440 Appliance (container) VM
441 ------------------------
442
443 A simple image with a stripped-down kernel (all drivers built into the kernel, virtio driver), no bootloader, no initrd and an init script which only runs the udhcpc DHCP client takes 1.2 seconds to boot. A future version of the Hypervisor may support [Firecracker](https://github.com/firecracker-microvm/firecracker) ([announced](https://aws.amazon.com/blogs/aws/firecracker-lightweight-virtualization-for-serverless-computing/) by AWS) to further reduce this time.
444
445 Below is an example log from vm-control creating such a VM with total time taken shown after:
446
447 ```
448 2019/01/12 00:24:43 creating VM on localhost:6976
449 2019/01/12 00:24:43 getting image
450 2019/01/12 00:24:43 unpacking image: test/2019-01-12:00:13:20
451 2019/01/12 00:24:43 starting VM
452 10.2.3.4
453 2019/01/12 00:24:44 Received DHCP ACK
454 real 0m1.216s
455 user 0m0.019s
456 sys 0m0.011s
457 ```