github.com/cloud-foundations/dominator@v0.0.0-20221004181915-6e4fee580046/design-docs/Dominator/README.md (about) 1 System Domination: Image Management 2 =================================== 3 Richard Gooch 4 ------------- 5 6 Overview 7 ======== 8 Richard Gooch 9 ------------- 10 11 This document describes the design of a robust, reliable and efficient architecture which can scale to the very largest fleet of machines (physical or virtual). The design target is that a single system and administrator can manage the content of *at least* 10,000 systems with negligible performance impact on the managed systems, fast response to global changes and nearly atomic changes on those systems. The software system that implements this architecture is called the **Dominator**. This software is Open Source and available on [GitHub](https://github.com/Cloud-Foundations/Dominator). 12 13 Background 14 ========== 15 16 Managing a large fleet of computer systems presents a different set of challenges than managing a single system or a few systems. A variety of package management and configuration tools have been developed to more easily manage single systems, and they work well enough for that purpose, although they are human-intensive. When configuring and updating a fleet of systems, different tools are required. The case studies in the section below compare operational efficiencies between two organisations, one using conventional fleet management tools and another using a system like **Dominator**. 17 18 Several configuration management tools have been developed which leverage the package management paradigm to configure and maintain large numbers of machines. The **Dominator** takes a radically different approach to fleet management based on three principles: 19 20 - immutable infrastructure 21 22 - golden “baked” images 23 24 - fast, robust transitions. 25 26 Rather than package management, the **Dominator** uses an image management and deployment approach. 27 28 The first two principles of immutable infrastructure and golden “baked” images have gained significant mindshare in recent years with the birth of the container development and deployment model. These two principles drive better development and operational behaviours and support comprehensive integration testing and reproducibility. This paradigm forces testing and configuration work to be 100% “front-loaded”, rather than the more familiar “push something first, then patch and configure later” model. The pre-config model yields both better quality testing and more confidence in deployments. 29 30 The third principle supports rapid and safe deployments. A fast transition reduces the chance of failure due to potential inconsistencies during the transition. A robust transition ensures that a transition does not fail part way through. Partial transitions can lead to failures if the system is left in an inconsistent state. Most configuration management systems fail to satisfy this principle, in large part due to their package-driven approach to updates. 31 32 One of the happy benefits of these three principles and the way **Dominator** implements them is that once an image has been baked and tested for “new” deployments, it is easy and safe to push that image to existing deployments, and to do so frequently and automatically. This leads to further benefits when images are scanned for vulnerabilities and compliance, as it is no longer necessary to scan machines: machines only need to be checked to ensure they have an approved image. 33 34 Operational Case Study Comparisons 35 ---------------------------------- 36 37 In one organisation with approximately 2,000 machines, a team of 30 engineers struggled with (and mostly failed) to keep systems in sync and up-to-date. This organisation did not have a tool like **Dominator** available. When an urgent security patch needed to be distributed, a team of 5 engineers performed some heroic work to deploy the fix to the fleet. Much of their time was spent selecting batches of machines to upgrade, push the upgrade and verify that the upgrade was completed and that the machines were not broken by the upgrade. Due to drift between the machines, there was no confidence that the fix was sure to work everywhere. In addition, updates were not certain to have completed. Each engineer was struggling to upgrade 400 machines. 38 39 In a well-known large Internet company, a system based on similar principles as **Dominator** is used to deploy significant changes to the entire fleet on a routine basis. The ratio of engineers (involved in pushing changes) to machines is 1:200,000. This dramatically better operational efficiency is due in part to a homogeneous operating environment, but also due to confidence in the testing and the reliability of the deployment system. 40 41 Non-goals 42 --------- 43 44 The **Dominator** system is not intended to be used to maintain images used inside containers. Tools such as *Docker*, *Kubernetes* and *Spinnaker* provide effective life-cycle management of containers. **Dominator** is targeted to the infrastructure that hosts containers and workloads which are not well suited to running inside containers. Best practice is to have a single binary inside a container, not an OS image. Using **Dominator** for updating small container images is overkill. 45 46 The **Dominator** system is one component of machine life-cycle management. Other components of life-cycle management are out of scope in this document, and the **Dominator** is not prescriptive in the choice of those other components. In addition, the **Dominator** does not set policy on machine management (i.e. it does not prescribe image content or rollout policies). Instead, it is a policy *enforcement* tool. To give context, below is a high-level view of machine life-cycle stages: 47 48 1. onboarding (discovery, bootstrapping) 49 50 2. allocation (which pool of machines, intended use) 51 52 3. policy choices (image selection, rollout speeds) 53 54 4. image deployment, updating and drift management (**Dominator**) 55 56 5. health monitoring and repair workflow 57 58 6. reallocation (go to 2) 59 60 7. off-boarding (deallocation, secure erasure, disposal) 61 62 Image life-cycle management is also a wider scope; **Dominator** may be a component of such a system (responsible for deployment), but does not prescribe the image life-cycle management system. 63 64 High-level Design 65 ================= 66 67 A key differentiator between the **Dominator** approach to configuration management and other systems is that the configuration and application are baked into an image along with the OS, and the image is pushed to the subject machines. Essentially, the configuration is pre-computed into an image and the image is pushed to a set of “dumb” nodes which only need to move files around and restart services. Since the nodes are dumb, the system is more reliable and repeatable compared to systems where the nodes have to perform a series of complex changes with arbitrary dependencies. 68 69 The system is comprised of the following components: 70 71 - an **Image Server** which stores images (file-system trees) 72 73 - a **M**achine **D**ata**B**ase (**MDB**) which lists all the machines in the fleet and the name of the *required* image that should be on each machine (an enhancement is a secondary *planned* image for each machine) 74 75 - a controller (master) system called the **Dominator** 76 77 - a slave agent on each machine in the fleet called the **sub**ject daemon (**subd**) 78 79 The following diagram shows how these components are connected: 80  81 82 The Image Server 83 ---------------- 84 85 This is a thin front-end to an object storage system back-end. It provides authentication, encryption and metadata services with a simple RPC over HTTP interface, which is available on port 6971, along with a HTML status page, built-in dashboards and internal metrics. All the RPC interfaces in the **Dominator** system transmit data in GOB (Go Binary) format. This is a self-describing data format that supports backwards compatibility similarly to Google protobufs, providing a degree of decoupling between client and server. 86 87 Images are written to the **Image Server** (in compressed tar format), which decomposes the image into unique files which are stored as objects. If multiple images are stored where most of the files have the same contents, they are automatically de-duplicated. Thus, adding images which are mostly the same as existing images consumes little extra space. 88 89 Image names may be any valid POSIX pathname (the leading / is ignored). Images may be deleted, but a particular image name can never be re-used. 90 91 The MDB 92 ------- 93 94 The **MDB** is the sole source of truth regarding which image (including configuration) is *required* to be on each machine. This makes it easy to gain a global view of the desired state of the fleet. This design also decouples *deployment and activation* from the *rollout policy and schedule*, which more easily allows for different policies for different groups or classes of systems and facilitates separation of powers. *Deployment and activation* is the scope of this document. 95 96 The **Dominator** 97 ----------------- 98 99 The **Dominator** continuously polls all the **sub**s, the **Image Server** and the **MDB**, and computes the difference between the desired file-system state of each **sub** and its actual state, and instructs each deviant **sub** to make corrections (add files, update file contents or metadata, remove files, etc.) as well as where to fetch object (file) data from. In addition, all configuration data (such as data transfer limits) are sent by the **Dominator**. The **Dominator** thus has global knowledge of which images are *currently* on which machines, and may be queried to determine the true state of the fleet. 100 101 The **Dominator** has configurable global rate limits such as: 102 103 - the percentage of machines (in a cluster) that may be in the rebooting state at any time 104 105 - percent of network bandwidth available for file transfers to/from **sub**s 106 107 - percent of local I/O bandwidth available for scanning files on the **sub** 108 109 - percent of local I/O bandwidth available for writing fetched files on the **sub** 110 111 - changes in health check failure rates and correlation with updates 112 113 The **Dominator** continuously drives **sub**s into compliance, and thus corrects and updates running machines as well as new (*birthed*) machines and machines which have returned from the dead, using the same mechanism. The **Dominator** presents a HTML status page on port 6970. This page contains links to various built-in dashboards. The same port is used to publish internal metrics. 114 115 The subs 116 -------- 117 118 The **sub**s continuously poll their local file-systems and construct a representation of the file-system. This representation may be queried using a **poll** RPC request. The **sub**s present a HTML status page on port 6969, publish internal metrics and present a RPC over HTTP interface on the same port. They obey a small set of RPC commands: **respond to poll**, **fetch files**, **update**, **cleanup**, **get files**, **get configuration**, **set configuration** and **check health**. The **sub**s have no knowledge of packaging systems. 119 120 A **fetch from peer** RPC could support peer-to-peer file transfers. See the section below on performance targets which shows that this optimisation is not needed for mid-sized clusters (say 10,000 machines). Larger clusters (say 100,000 machines) may benefit from this optimisation. 121 122 Configuration Management 123 ======================== 124 125 Unlike other configuration management and deployment systems which group configuration file changes and package installation/upgrade/removal into bundles which are pushed out to sets of machines, the **Dominator** system separates configuration and package bundling from deployment. Instead, an image is created (separately from the **Dominator** system) which includes all the desired packages and configuration files for a class of machines, and that image is pushed to all machines of a certain class. This approach is based on the observation that large, efficiently managed fleets have only a few classes of machines. If you have many different classes of physical machines in your fleet, *you’re doing it wrong*, and *you probably should be using Containers*. 126 127 The **Dominator** system supports an arbitrary number of machine classes (since each machine may have a different image), but pushes the pain and complexity to where it belongs: a dedicated configuration system and the people who make the choices for how many different machine classes they want. 128 129 The **Dominator** should be able to push updates at the same speed regardless of how many different machine classes (images) need to be pushed, since each machine has just one *required* image. Thus, one could build images for OpenStack compute nodes, Kubelet minions, Ceph storage nodes and Analytics nodes and decide the role for each machine by assigning the appropriate image for each machine in the **MDB**. 130 131 A consequence of the image-based deployment system is that integration testing is much simpler to manage than other systems where a “base” image is deployed and then different machines are configured to receive different collections of extra packages. Those systems can lead to an impractically large configuration matrix, which requires many different combinations to be tested and where it is difficult to grasp the disposition of the fleet. 132 133 With the image-based system, testing is clear: the image is tested. Once that is done, the image can be deployed to a large number of machines with confidence, knowing that there will not be a difficult to understand collection of packages added on top, which may have cross dependency problems or incompatibility issues with the base image. 134 135 The Life of an Update 136 ===================== 137 138 Consider updating the image on a single machine. The following steps are taken: 139 140 - the new *required* image is written to the **MDB** record for the machine 141 142 - the **Dominator** computes the *required* file-system state based on the image name recorded in the **MDB** and the results of the last poll of the **sub** 143 144 - the **Dominator** directs the **sub** to fetch any files it needs from the **Image Server** and store them in a private cache, provided the **sub** has sufficient space available 145 146 - the **Dominator** re-polls the **sub** and if all the required files are available directs it to perform the update 147 148 - the **sub** will perform a nearly atomic update on the system, ensuring that the time that the system spends in an inconsistent state is minimal (typically a small fraction of a second): 149 150 - move the fetched files to their desired locations 151 152 - delete unwanted files 153 154 - restart any affected daemons, perform a health check and return the completion status and result to the **Dominator** in the next **poll** response (if the kernel was not changed) 155 156 - reboot the machine (if the kernel was changed) 157 158 - the **Dominator** will continue to poll the **sub** for its file-system state and health check results 159 160 From the perspective of the **Dominator** system, performing an update is the same as keeping machines in compliance. The only difference is that the number of files to fetch and change is usually larger with an update. 161 162 Note that a key property of this design is the nearly atomic update of the **sub**. This is fundamentally different from many other configuration change systems, which typically push higher-level commands to update packages and configuration files, requiring the target machines to perform more work and resolve package dependencies during the update. That approach leaves the machine in an inconsistent state for longer even in the best case where all the changes can be performed and the dependencies met in a single update run. In practice, subtle dependency problems, network interruptions or problems with the package repository can cause an update to fail, leaving the system partially updated until the next update run comes along. 163 164 Kernels, Images and Firmware 165 ============================ 166 167 The **Dominator** system can handle not just system images but also kernels and firmware. The data component of kernels and firmware are the same as for system images: files in the file-system. The only difference is their *activation*: 168 169 - system images are activated by moving the files in place and restarting any daemons that depend on the changed files 170 171 - kernels are activated by rebooting the machine 172 173 - firmwares (BIOS, network controller, etc.) are activated by writing to special device files and restarting the device and device driver or rebooting the machine 174 175 These differences are just configuration details to the **Dominator** system. 176 177 Component Details 178 ================= 179 180 Image Server 181 ------------ 182 183 The **Image Server** processes *add image* and *add objects* RPCs and saves the image and object data to the local file-system. An image is a representation of a file-system tree, with the file data encoded into the image structure as SHA-512 checksum. The file checksums are used as object IDs. Thus, multiple copies of the same or similar image only requires storing the image representations each time. The unique file data are stored only once, as objects. 184 185 Image names may be any valid POSIX-like pathname (leading ‘/’ is ignored). Directories may be created and owner groups assigned to delegate image management of subtrees in the namespace to different teams. 186 187 The **Image Server** responds to *get image*, *list images* and *get objects* RPCs by the **Dominator** and **subd**. 188 189 An **Image Server** can be “slaved” to another **Image Server**. This is used for image replication between servers. Any (non-cyclic) topology can be constructed. Image replication is secured with TLS and access controls. An example Content Distribution Network is shown in Appendix 2. 190 191 An operations guide for the **Image Server** is available [here](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/imageserver/README.md). 192 193 Subd 194 ---- 195 196 The **subd** component bears some similarity to rsync, scp, BitTorrent and other file distribution technologies, but it also performs carefully orchestrated *activation* functions, and thus a special-purpose daemon is best suited. Key properties of **subd** are: 197 198 - rate-limited scanning of the local file-system 199 200 - rate-limited network transfers 201 202 - authenticated RPCs 203 204 - directed file transfers (i.e. it’s told where to fetch files from) 205 206 - authenticated file transfers 207 208 - separate *fetching* and *activation* phases 209 210 - nearly atomic *activation* (to avoid inconsistencies during transitions) 211 212 - health checking (to limit the damage done by the push of a bad change) 213 214 - the ability to detect file corruption through constant checksum scanning of the file-system (SHA-512 is employed). Some examples of file corruption that evades detection by mechanisms such as inotify/fanotify but is detected with checksum scanning: 215 216 - hardware errors 217 218 - firmware bugs 219 220 - kernel bugs 221 222 - malicious/exploit code (either directly writing to the block device, memory, or leveraging kernel bugs) 223 224 In summary, **subd** provides safe, secure slow or fast *fetching* and fast *activation*. Some of the properties mentioned are covered in more detail below. An operations guide for **subd** is available [here](https://github.com/Cloud-Foundations/Dominator/blob/master/cmd/subd/README.md). 225 226 ### Rate-limited scanning 227 228 The file-system scanning is rate limited to reduce the I/O impact of **subd**. The performance of the storage device is benchmarked the first time **subd** is started, and further reads from the storage device are limited by default to 2% of the benchmarked speed. By limiting to 2%, **subd** has negligible impact on system workload. Note that if the entire file-system being scanned fits inside the page cache of the machine, then little or no transfers from the storage device are required, and thus **subd** will scan the file-system at maximum speed. If there is insufficient memory to cache the entire file-system, then **subd** will detect that it is reading from the storage device and will automatically reduce speed. Furthermore, in environments where there is competition for memory, the rate limiting performed by **subd** will in turn reduce the memory pressure it places on the system. This process is completely automatic. In summary, **subd** will consume more system resources on an idle machine and will consume a small trickle of resources on a busy machine. 229 230 ### Scanning versus file notification 231 232 File scanning and checksumming is more robust than file change notification (i.e. *dnotify*, *inotify* and *fanotify*), as it can detect file changes due to storage media corruption, file-system corruption, memory errors, kernel bugs, direct writes to the underlying block device, intrusions and so on. File notification systems are not able to detect these types of changes. For these reasons, file scanning and checksumming is *essential* whereas file notification can at best serve as an *optional optimisation* which may be used to reduce the time to detection of file changes. 233 234 ### Rate-limited network transfers 235 236 Similarly to the file-system scan rate-limiting, **subd** will benchmark the network capacity and will limit network file fetches by default to 10% of the capacity. This limit is not dynamically adjusted, although it is configurable. 237 238 Security: Authentication and Authorisation 239 ========================================== 240 241 Connections between the components may be secured using SSL/TLS1.2 or later. The components will switch to and require secure mode if provided with SSL certificates and keys. In secure mode, clients must authenticate to the servers in order to secure RPC endpoints. Servers must present a valid and trusted certificate to the clients. 242 243 A client is permitted to call a RPC endpoint if it has the `Service.Method` name present in a comma separated list in the Common Name section of the certificate subject. The certificate must be signed by the Certificate Authority (or a trusted intermediate) that the server trusts. This approach allows creating certificates for service accounts with the required powers (i.e. the **Dominator** would have `Subd.*` power) and certificates for different users who may have different powers (i.e. some users may only have `ImageServer.AddImage,ObjectServer.AddObjects` powers). 244 245 Configuration 246 ============= 247 248 MDB Data 249 -------- 250 251 The **Dominator** requires a list of machines and their *required* and *planned* images. It obtains this information from a **MDB**. The **Dominator** does not directly contact the **MDB**, instead it reads a file with a simple JSON format that contains the information it requires. The **Dominator** periodically checks the metadata (size, mtime, inum, etc.) for this file and if it’s changed it reads the new file. A separate daemon or cron job reads from the **MDB** and writes out a new file. The writer of this file should write the data to a temporary file on the same file-system and then atomically move the file to the desired location (`/var/lib/Dominator/mdb` by default). This ensures the **Dominator** does not read partial files. 252 253 Below is an example JSON file containing three machines: 254 ``` 255 [ 256 { 257 "Hostname": "mailhost", 258 "RequiredImage": "mail.0", 259 "PlannedImage": "mail.1" 260 }, 261 { 262 "Hostname": "compute0", 263 "RequiredImage": "ubuntu-14.04", 264 "PlannedImage": "ubuntu-14.05" 265 }, 266 { 267 "Hostname": "compute1", 268 "RequiredImage": "centos-6", 269 "PlannedImage": "centos-7" 270 } 271 ] 272 ``` 273 It is safe to not specify `PlannedImage`. It is also safe to include extra fields (they will be ignored). 274 275 Images 276 ------ 277 278 Images are uploaded to the **Image Server** using the imagetool utility. This utility performs many image-related commands, including add image. The syntax is: 279 ``` 280 imagetool add name imagefile filterfile triggerfile 281 ``` 282 The `name` parameter is the desired name of the image. 283 284 The `imagefile` parameter is the path to a (possibly compressed) tar file containing the file-system image. 285 286 The `filterfile` parameter refers to a file which contains lines. Each line contains a regular expression which specifies which files should be excluded from the image and which will *not be updated on the **sub***. Below is an example filter file: 287 ``` 288 /tmp/.* 289 /var/log/.* 290 /var/mail/.* 291 /var/spool/.* 292 /var/tmp/.* 293 ``` 294 The `triggerfile` parameter refers to a JSON-encoded file containing a list of *triggers*. A *trigger* is a rule that will (re)start a specified service if any of a set of files (defined by regular expression pattern matches) are changed during due to an *update*. Below is an example trigger file: 295 ``` 296 [ 297 { 298 "MatchLines": [ 299 "/etc/ssh/.*", 300 "/usr/sbin/sshd" 301 ], 302 "Service": "ssh", 303 "HighImpact": false 304 }, 305 { 306 "MatchLines": [ 307 "/etc/cron[.]*", 308 "/usr/sbin/cron" 309 ], 310 "Service": "cron", 311 "HighImpact": false 312 }, 313 { 314 "MatchLines": [ 315 "/lib/modules/.*", 316 "/boot/vmlinuz-.*" 317 ], 318 "Service": "reboot", 319 "HighImpact": true 320 } 321 ] 322 ``` 323 Services are (re)started by passing the name of the service and either `start` or `stop` to the service utility. 324 325 The `HighImpact` field is used to tell the **Dominator** that (re)starting the service will have a high impact on the machine (such as a reboot). The **Dominator** can use this limit the number of high impact changes at a time (e.g. to enforce a policy that no more than 100 machines at a time will be rebooted, it will wait for machines to come back up before rebooting more). 326 327 Advanced Features 328 ================= 329 330 Computed Files 331 -------------- 332 333 Each system has three classes of files: 334 335 - those which should be the same on every machine and thus should be managed by the **Dominator** 336 337 - those which are unique per machine and do not need to be managed/updated centrally (system logs in `/var/log` and hardware-specific data such as in `/etc/fstab`) and are thus excluded from **Domination** 338 339 - those which are common to groups of machines but may not be common across the entire fleet, yet follow a pattern and thus may be computed. These are *computed files* and are discussed here. 340 341 Consider the `/etc/resolv.conf` file, which contains the IP address of the DNS server. In typical global fleets there are multiple datacentres, each with its own DNS server (a single global DNS server would have poor performance characteristics). In environments where it is not feasible to set up a Virtual IP address for the DNS server (where the network directs DNS query traffic to the nearest DNS server), every machine in a datacentre shares the same file, but this file differs in other datacentres, so it cannot be baked into a common image. This is a good use-case for computed files, where the **Dominator** is configured to compute the contents of this file based on the datacentre. 342 343 Another example is the `/etc/ssl/CA.pem` file, which **subd** uses to determine which **Dominator** to trust for change RPCs. It may be necessary to establish different “zones of trust” within an organisation. Consider a Hybrid Cloud environment where an organisation has private (internal) cloud infrastructure as well as using Public Cloud providers. Some shared trust is likely to be desirable, such as allowing an **Image Server** in the Public Cloud to be configured as a replication slave of an **Image Server** in the internal cloud. This arrangement is useful for distributing images across the full Hybrid Cloud environment. However, it may not be desirable to use the same trust zone for updates, since an organisation may not want a **Dominator** in a Public Cloud to be able to control the contents of machines in their internal cloud. This is where computed files are useful. The `/etc/ssl/CA.pem` file can be configured to be a computed file, and thus different contents can be pushed to **subs** in different trust zones, and the **Dominators** will have different keys which are trusted in different trust zones. The trust zones for updates can be completely separated or (more usefully) a certificate signed by the “internal cloud” CA is trusted everywhere whereas a certificate signed by the CA for a particular Public Cloud provider is only trusted in that trust zone. 344 345 A special case of computed files are files which are unique per machine yet can (or should) be centrally managed and distributed. An example of this is machine certificates. One approach is to exclude these files from **Domination** and have the machine **Birther** generate and place these files. This approach makes it difficult to revoke/replace certificates. Alternatively, these certificates could be computed by the **Dominator** (perhaps with it calling a certificate generation service) and then distributed by the **Dominator**. If necessary a fleet-wide certificate update could be performed in seconds, with the limiting factor likely being the speed at which certificates can be generated. 346 347 Computed files are generated by instances of the **filegen-server**. When an image is constructed, some files in the image may be marked as computed files with the source of the file data being the address (hostname:port) of a **filegen-server**. A single image may contain multiple computed files sourced from different **filegen-server**s. When the **Dominator** pushes an image to a **sub** it will query the appropriate **filegen-server** to obtain file data for any computed files and push that data to the **sub**. 348 349 In the above example of using the **Dominator** to distribute machine certificates, a **filegen-server** could be deployed which generates these certificates on demand. The **filegen-server** API supports the concept of a time limit on the validity of file data and will automatically regenerate file data that have “expired”, which will in turn lead to the **Dominator** pushing updates to the **sub**s. An example configuration would be to generate certificates with 24 hour lifetimes and mark the file data as valid for 12 hours. The certificate for each **sub** would be re-generated and distributed every 12 hours. Provided the **Dominator** system was not down for over 12 hours, every **sub** would always have a valid certificate. 350 351 The diagram below shows multiple **filegen-server**s and their communication paths. 352  353 354 Planned Images 355 -------------- 356 357 This mechanism supports a controllable “preload” of image data, which is useful for automated image build systems. Consider the following sequence: 358 359 - the image build system produces a new image 360 361 - the image build system updates some or all of the *planned* image entries in the **MDB**. It has access rights to do this, since this is a safe operation. It may not have access rights to update the *required* image entry for all machines, as that is a potentially unsafe operation, or may violate rollout policies 362 363 - a regression test is started on a pool of test machines. The image build system has update access rights to the *required* image field for these machines 364 365 - once the regression test passes, the image build system updates the *required* image field for a larger set of canary machines (which it has access rights to), such as a single datacentre 366 367 - the new image will probably be preloaded on all the canary machines by the time the regression test completes, thus it can be deployed as quickly as policy allows 368 369 - once the defined canary time has passed, the new image can be pushed globally, again as fast as policy allows. A different system (or person) may have the access rights to updated the *required* image field for all machines 370 371 The *planned* image is only pushed if the *active* image is the same as the *required* image. 372 373 Fast, Secure re-Imaging 374 ----------------------- 375 376 This section posits a [**Birther**](../MachineBirthing/README.md) system which is designed to leverage the **Dominator** to deploy images. This system is not implemented, so this section is currently a guide to how it would work. 377 378 When a machine is re-provisioned for a different purpose, it may be wise to *re-image* it (wipe the file-system and re-install). This is typically done by sending a machine back to the **Birther** which already takes care of creating file-systems and installing the OS image. This is an expensive operation as it requires fetching the full OS image across the network. 379 380 A **Birther** boot image can take advantage of the **Dominator** by copying the file-system contents to a tmpfs prior to repartitioning/reformatting file-systems and then can move the saved files to the newly created file-system inside the object cache directory maintained by **subd**. In general the new OS image will have many files (objects) in common with the old OS image, and thus the **Dominator** will generate only a small amount of network traffic to perform a full re-image of the machine. 381 382 This system is secure (i.e. the old incarnation cannot leave a trojan behind for the next incarnation) because old files will be fully checksummed as they are moved into the object cache, assuring that only intended files are moved to the new file-system. In addition, the object cache is purged of unused objects. Since the process of converting the file-system into an object cache is performed while booted into the trusted **Birther** boot image, there is no running code from the old file-system which could subvert this security. 383 384 The **fs2objectcache** utility has been implemented which performs the conversion of a file-system to an object cache. The **Dominator** is ready to support this feature. 385 386 The new OS image can even be “pre-loaded” prior to re-provisioning by setting the *planned* image appropriately. This would facilitate even faster re-imaging, which is useful in environments where limiting downtime is critical. 387 388 Domination as a Service (DaaS) 389 ============================== 390 391 The **Dominator** system is intended to be one component of a robust and scalable foundation for Cloud infrastructure (the “Undercloud”): the management of a fleet of physical machines. This system can also be used within a fleet of virtual machines, where a customer/tenant sets up one VM to run the **Dominator**, **Image Server** and **MDB**. An alternative is to use the same system used for the Undercloud for tenant VMs. This is termed **Domination as a Service** (**DaaS**), and has the following extra requirements: 392 393 - VMs are recorded in the **MDB** 394 395 - tenant networks are configured to allow traffic to/from the **Dominator** and **Image Server** 396 397 Calculating Performance Targets 398 =============================== 399 400 File Fetching 401 ------------- 402 403 As stated earlier, a single **Dominator** system and a single **Image Server** system should be able to manage at least 10,000 **sub**s. Let’s consider the following environment: 404 405 - a cluster of 10,000 machines 406 407 - 1 GB/s (10 Gb/s) network 408 409 - 1 GB system image 410 411 - SSD storage with 500 MB/s write throughput 412 413 In this environment, it would be possible to perform a complete system upgrade (such as when [**birthing**](../MachineBirthing/README.md) a machine) for a single machine in 2 seconds. When birthing many machines, the limiting factor is downloading the system image, as this is the largest component of network traffic. Thus, 3,600 machine per hour can be birthed *without any peer-to-peer enhancements*, with the limiting factor being bandwidth out of the **Image Server**. 414 415 A typical “large” system image upgrade changes less than 10% of the files on the system, which would require less than 100 MB of network traffic to each **sub**, which can be transferred in 0.1 seconds at maximum network speed. For such a change, 10 machines per second could be upgraded, which would be 1,000 seconds (under 17 minutes) for an upgrade of all 10,000 machines in the cluster. Again, this is *without any peer-to-peer enhancements*. 416 417 During normal operations, it is probably undesirable to consume the entire network bandwidth of a **sub** for system upgrades, even for 100 milliseconds, as this could affect customer jobs running on the machine. Policy will probably dictate that no more than 10% of the network bandwidth can be used for non-emergency pushes. This would increase the time taken by each **sub** to *fetch* its files to 1 second and thus increase the time to upgrade each **sub**, but since multiple **sub**s can fetch from the same **Image Server**, the time to upgrade the 10,000 machine cluster remains the same: under 17 minutes. This probably exceeds the maximum speed at which operations staff are comfortable with upgrading an entire cluster (especially if the cluster contains more than a few percent of the global capacity). 418 419 Once again, the limiting factor is bandwidth out of the **Image Server**, and the speed at which the system can perform probably greatly exceeds the speed at which operations staff would accept under normal circumstances. 420 421 A typical “small” system image update is around 1 MB (configuration files and a security fix for a package). The **Image Server** can distribute updates to 1,000 machines per second, so a 10,000 machine cluster could be upgraded in 10 seconds. Once again, this is *without any peer-to-peer enhancements*. This is probably too fast except for emergency pushes. 422 423 The above examples show that it is more important to implement configurable rate limits for the **Dominator** system rather than optimise pushing data around. 424 425 File Scanning 426 ------------- 427 428 The file-system is continuously scanned by **subd**. The time it takes to complete a full scan determines how “fresh” its state information is, which in turn is a limiting factor in how quickly the **Dominator** system can correct files on deviant machines. A typical HDD can sustain a read speed of 50 MB/s, assuming mostly contiguous reads. At this speed a 1 GB file-system can be scanned in 20 seconds. 429 430 Continuously scanning the file-system at the maximum rate would interfere with I/O for jobs running on the system (unless the system image is stored on a dedicated device containing the root file-system), so the scanning rate is likely to be configured to be 2% of the maximum sustainable rate (1 MB/s for a typical HDD). Thus, a full file-system scan would take 1,000 seconds (under 17 minutes). A 17 minute delay from deviation to detection is probably sufficient for most environments. 431 432 The performance of SSDs are typically at least 10 times faster than HDDs (500 MB/s or more), so a 1 GB file-system scan at 2% of the maximum rate would take 100 seconds. If a 17 minute deviation to detection delay is too large, the root file-system should be placed on SSD. 433 434 Polling 435 ------- 436 437 The **Dominator** continuously polls each **sub** in the fleet to determine its current file-system state. A 1 GB system image typically contains 40,000 files with an average filename length of 50 bytes and a checksum length of 64 bytes (SHA-512). The file-system state would thus consume 4.56 MB. Here the limiting factor is bandwidth into the **Dominator**, which can poll 219 **sub**s per second. In a 10,000 **sub** cluster, the **Dominator** would poll each **sub** every 46 seconds, which is much lower than the file-system scanning time. A single **Dominator** could scale to 219,000 machines with HDDs before polling became a limiting factor. 438 439 Polling speed is optimised since each **sub** stores a generation count of the file-system state which increments when a scan yields different file-system state compared to the previous scan. This generation count is reported in the **poll** response. The **Dominator** records the generation counter and includes it in the **poll** request. The **sub** will only provide the file-system state information if the generation counts differ. In most cases, the OS file-system state does not change between polls, so this would be a very effective optimisation. The limiting factor would probably be the time taken to set up TCP connections and perform the TLS handshake. A single **Dominator** machine could likely handle polling of 1,000,000 or more **sub** machines. 440 441 Birthing Machines 442 ================= 443 444 The **Dominator** system may be used to optimise the birthing of machines. The [**Birther**](../MachineBirthing/README.md) system would install a minimal payload on a machine (**subd** and an appropriate certificate authority file), start up **subd**, add the machine to the **MDB** and wait for the **Dominator** to install the system image, which will complete the birthing process. [**Birthing**](../MachineBirthing/README.md) machines is the subject of a separate document. 445 446 Auditing, Compliance Enforcement and Intrusion Detection 447 ======================================================== 448 449 The **Dominator** system is clearly a compliance enforcement system, as it continuously forces systems into the *required* state. As seen above, with SSD storage, a **sub** with minor deviations can be forced back into compliance in under 2 minutes, with negligible performance impact on system workload. 450 451 Since the **Dominator** has knowledge of intended changes and the state of all **sub**s in the fleet, it can also be queried for auditing purposes and intrusion detection. For example, unexpected changes to the `/etc/passwd` file may indicate unapproved changes (users making changes outside the proper channels) or a possible intrusion attempt. By logging all changes made by the system (and the reason for the changes), global auditing and intrusion detection tools can easily be developed. 452 453 Operational Guidelines 454 ====================== 455 456 Below are some guidelines for reliable operations: 457 458 - size the root file-system to be twice the image size, as this will avoid updates being blocked due to lack of space 459 460 - keep logs and spool data on a separate file-system, so that out-of-control data cannot block updates (a working update system can be critical during emergency repairs) 461 462 Below are some best practices guidelines: 463 464 - set the default network bandwidth for **sub**s to 10% of their capacity 465 466 - set the default local I/O bandwidth on **sub**s for scanning the root file-system to: 467 468 - 2% of capacity if stored on a shared device 469 470 - 50% of capacity if stored on a dedicated device 471 472 - restrict write access to the **MDB** to trusted services which can enforce policies such as: 473 474 - minimum time to upgrade a cluster (by limiting the rate at which the *required* image name fields can be changed) to give operations staff time to notice a disaster and hit the emergency brake 475 476 - do not update an image name field if the image has been tagged as deprecated 477 478 Implementation 479 ============== 480 481 The **Dominator** is written in the [Go](https://www.golang.org/) programming language. It is an Open Source project hosted on the [Cloud-Foundations/Dominator](https://github.com/Cloud-Foundations/Dominator) page at [GitHub](https://www.github.com/). Contributions are welcome. A short [fact sheet](FactSheet.md) and [architectural overview](ArchitecturalOverview.md) are available. 482 483 Release Milestones 484 ================== 485 486 Below are the anticipated releases: 487 488 Version 0: Minimum Viable Product (Released 8-Nov-2015) 489 ------------------------------------------------------- 490 491 This is the minimum needed to be able to push images, update machines and keep them in compliance. The following features are not expected to be available: 492 493 - **authentication and authorisation of RPCs** 494 495 - integration with a real MDB 496 497 - scaling and performance optimisations 498 499 - nice dashboards 500 501 - deleting of images 502 503 - storing object data in an object storage system such as Ceph 504 505 - Glance integration 506 507 The most serious limitation is the lack of RPC security (authentication). It was omitted from **v0** in order to speed development and provide a MVP to people who wish to experiment and evaluate the technology. 508 509 Without RPC authentications, anyone with network access to your **sub**s can set up their own **Dominator** instance and use it to control your **sub**s. 510 511 Version 1: Security (Released 10-Dec-2015) 512 ------------------------------------------ 513 514 This will add authentication and authorisation for the sensitive RPCs (in particular, the Update() RPC). The **sub**s will have a Certificate Authority file which they can use to validate that Update() RPCs come from the authorised **Dominator**. This addition will make the system safe to deploy and use. 515 516 Version 2: Integration with a real MDB (Released 30-Jan-2016) 517 ------------------------------------------------------------- 518 519 This will add integration with a MDB implementation (most likely the implementation adopted at Symantec, where most of the development is being done). This represents an important milestone for Symantec internal use, but other users may prefer to write a simple script to interface to their particular MDB implementation. The **Dominator** reads a simple JSON file from the local file-system, so generating **Dominator**-compatible MDB data is trivial. 520 521 Version 3: Computed Files (Released 19-Mar-2016) 522 ------------------------------------------------ 523 524 This will add support for *computed files* as described above. 525 526 Version 4: Scaling and Performance Optimisations 527 ------------------------------------------------ 528 529 The MVP may not be as lean and efficient as desired, which could be an issue for herds with many thousands of machines. 530 531 Version 5: Nice Dashboards 532 -------------------------- 533 534 The MVP will come with some basic dashboards. After some operational experience, it is anticipated that an improved set of dashboards will be designed and implemented. 535 536 Version 6: Storage Improvements 537 ------------------------------- 538 539 Over time an installation will begin to fill up the local storage capacity of the system hosting the **Image Server**. Safely deleting unused images and performing garbage collection will be simple improvements that should help a lot. If needed, further improvements such as using Ceph for object storage may be implemented. 540 541 For an environment where deep OpenStack integration is needed, it may be useful to integrate with Glance. For example, if there is an image release pipeline where images are moved from one system to the next as they pass the various qualification, testing and deployment stages, images may flow directly between Glance and the **Dominator** system (although which direction the images may flow would depend on the whole release pipeline design). 542 543 Appendix 1: subd Performance Data 544 ================================= 545 546 The following table presents data on performance impact measurements for subd. The sysbench benchmarking tool was used to measure CPU, I/O and memory performance without subd running, and then again with subd running. All the results show time taken to perform a benchmark (in seconds). In all cases, the file-system is backed by EBS. Lower numbers are better. 547 548 | **Machine Type** | **CPU without subd** | **CPU with subd** | **File I/O without subd** | **File I/O with subd** | **Memory without subd** | **Memory with subd** | 549 |------------------|----------------------|-------------------|---------------------------|------------------------|-------------------------|----------------------| 550 | AWS t2.micro | 11.4586 | 11.6055 | 32.6241 | 32.6238 | 70.8254 | 70.3938 | 551 | AWS t2.medium | 11.5638 | 11.5633 | 42.998 | 32.8201 | 73.9858 | 74.1603 | 552 | AWS m3.medium | 22.8409 | 24.9580 | 55.2227 | 55.3405 | 135.5237 | 149.6894 | 553 | AWS c4.xlarge | 9.7052 | 9.7700 | 22.3376 | 22.3665 | 61.6101 | 62.1307 | 554 | AWS d2.xlarge | 11.8251 | 11.5455 | 23.4539 | 23.3757 | 75.0767 | 75.0598 | 555 556 As can be seen from these results, in general the performance impact of subd on the simulated workload (sysbench) was negligible. 557 558 The exception was on the m3.medium machine type, where there was a ~9% impact on CPU-intensive and memory operations (memory and CPU have similar contention behaviour). The reason only this instance showed a measurable impact is because the machine has a single CPU, so there is some contention for the CPU. Since subd runs at a lower priority (nice +15) than the simulated workload, it receives ~8% of the CPU resources. 559 560 This contention did not occur on the t2.micro machine which also has a single CPU because the t2.micro machine has insufficient RAM to hold the root file-system in the page cache, and thus scanning the file-system requires accessing the underlying media, which triggers the I/O rate-limiting in subd and thus it spends most of it’s time sleeping, so as to not exceed 2% of the I/O capacity. 561 562 There was one anomalous measurement where file I/O on the t2.medium machine was faster with subd running than without. This can be explained by natural variation in the performance of EBS. 563 564 Auto-Scaling 565 ------------ 566 567 When subd is CPU-bound (this can occur when the root file-system can fit into unused RAM and there is little memory pressure on the system), it has the potential to affect the auto scaling behaviour for AWS instances. Specifically, if the auto scaling group is configured to launch new instances (“scale out”) when the CPU utilisation exceeds a defined threshold *and* the CPU-bound subd causes the CPU utilisation to exceed this threshold, then new instances will continue to be launched until the maximum instance limit is reached. This can lead to unnecessary resources being consumed. A series of tests were performed with a variety of instance types and an auto scaling group was created where new instances would be launched if CPU utilisation exceeded 80%. The following results were observed (and are expected given an understanding of the instance types): 568 569 - t2.micro (burstable) instances. In this case, the root file-system (~1.5 GB) did not fit into RAM (1 GiB), so subd used < 1% CPU on average. No extra instances were launched by the auto scaler 570 571 - t2.small (burstable) instances. In this case, the CPU consumption of subd caused other instances to be launched and later were terminated. This is because each instance bursted to 100% CPU utilisation for ~20 minutes, which caused another instance to be launched. After ~20 minutes, the instances consumed their initial CPU credits and were throttled, which in turn lowered their CPU utilisation and the auto scaler terminated (“scaled in”) instances 572 573 - m3.medium instances. In this case, the auto scaler continued to launch new instances until the maximum instance limit was reached. This instance type has a single VCPU and thus the CPU utilisation was 100%, which triggers the auto scaler 574 575 - m3.large and larger instances. In this case, the auto scaler did not create new instances. This is because these instances have multiple VCPUs and subd consumed 100% of a single VCPU. Thus, for the m3.large instance, the CPU utilisation was 50%, which was under the scale-out threshold 576 577 The first lesson from these experiments is that burstable instances should not be used with auto scaling, as the throttling behaviour conflicts with the auto scaling. The second lesson is that auto scaling should be used with care the behaviour should be measured with small limits before opening the floodgates. The third lesson is that CPU utilisation is an inadequate metric for deciding whether to scale out, as it does not reflect the performance of the workload. A better measure would be application latency or throughput. 578 579 In summary, for most instance types, subd should be safe with auto scaling, as it will use 100%/nVCPU. With the exception of the m3.medium instance type where nVCPU=1, this utilisation will be below 50%. Even on an m3.medium instance, the application will likely consume sufficient RAM such that the root file-system cannot fit into the remaining RAM, and thus subd will rate limit its I/O and use a tiny amount of CPU. 580 581 Appendix 2: Image Content Distribution Network 582 ============================================== 583 584 The diagram below is an example of a real-life network of **Image Servers** which distribute content globally. In this example, the content builder is in an on-premise facility in US East. It uploads new images to the nearby **Image Server** which will send any new objects to it’s master (upstream **Image Server**), which in turn sends to its master (in this example the root master). The (root) master **Image Server** is responsible for checking the uniqueness of objects (i.e. hash collision detection) and images. 585 586 The thick arrow indicates that the entire uploaded image is transmitted to the nearby **Image Server** (required for detecting hash collisions). The other arrows are thin, representing that only new objects are transmitted (it’s typical that an image upload has fewer than 1% new objects). 587 588 This architecture ensures that - even if a content builder has a slow link to the rest of the network - injecting images can still be very fast if there is a nearby **Image Server** replica with a high speed connection to the content builder. 589 