go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/gce/appengine/backend/README.md (about) 1 # GCE Backend 2 3 A backend package for the GCE GAE app. Comprised of independent, idempotent 4 cron jobs which trigger independent, idempotent task queues which attempt to 5 move the real-world state of GCE instances closer to the configured state of GCE 6 instances. The cron jobs and task queues are fault tolerant-- failures do not 7 generally cause inconsistent state, allowing the task queues to be triggered 8 again later by the cron jobs. This means transient failures such as datastore or 9 network outages and insufficient permissions or quota only cause failures in the 10 backend package as long as they remain unresolved. Once the issues are resolved, 11 the backend package should recover without intervention. 12 13 ## Terminology 14 15 ### Config 16 17 A Config is a datastore entity representing a configured type of VM. Creation of 18 Configs is outside the scope of the backend package. Configs are mutable and may 19 be created, updated, or even deleted at any time and the backend package will 20 react accordingly. 21 22 ### VM 23 24 A VM is a datastore entity representing a single configured VM, derived from a 25 Config. [expandConfig](#expandConfig) is responsible for the derivation. VMs 26 are mutable, but should only be modified by the backend package. To make changes 27 to a VM, modify its corresponding Config and the backend package will propagate 28 the changes. The Config:VM mapping is 1:n. 29 30 ### GCE Instance 31 32 A GCE instance is a live virtual machine running in Google Compute Engine. An 33 instance is created from a VM by [createInstance](#createInstance). Instances 34 are immutable. Changes made to a VM will only be reflected when creating a new 35 instance. The VM:instance mapping is 1:1. 36 37 ### Swarming Bot 38 39 A Swarming bot is the Swarming server's view of a connected instance. Instances 40 automatically register themselves as bots of a particular Swarming server 41 outside the scope of the backend package. Bots may freely be terminated or 42 deleted from the Swarming server and the backend package will react accordingly. 43 The instance:bot mapping is 1:1. 44 45 ### Deadline 46 47 The deadline is how long an instance may live for. An instance's deadline is 48 derived from the lifetime in the Config and the instance creation time. Once the 49 deadline is up, the backend package will attempt to replace the instance after 50 it finishes its current Swarming workload. Replacing the instance is how changes 51 to VMs are picked up, since instances are immutable. 52 53 ### Drained 54 55 A drained VM is one scheduled for deletion because the Config has been altered 56 to have its number of VMs decreased. A drained VM will be deleted once its 57 corresponding instance has been deleted. A drained Config is one scheduled for 58 deletion by some external factor. All VMs of a drained Config will be drained. A 59 drained Config will be deleted once its corresponding VMs have been deleted. 60 61 ## Cron Jobs 62 63 All cron jobs operate on multiple entities, triggering task queues which operate 64 on a particular entity. All cron jobs are idempotent. 65 66 ### expandConfigsAsync 67 68 expandConfigsAsync iterates over all Configs and triggers 69 [expandConfig](#expandConfig) for each. 70 71 ### createInstancesAsync 72 73 createInstancesAsync iterates over all VMs which have no corresponding instance 74 and triggers [createInstance](#createInstance) for each. 75 76 ### manageBotsAsync 77 78 manageBotsAsync iterates over all VMs which do have a corresponding instance and 79 triggers [manageBot](#manageBot) for each. 80 81 ## Task Queues 82 83 All task queues are triggered with a particular entity to process. All task 84 queues are idempotent. 85 86 ### expandConfig 87 88 expandConfig receives a single Config to expand. It checks how many VMs the 89 Config declares and triggers [createVM](#createVM) for each. 90 91 ### createVM 92 93 createVM receives a single VM to create. It creates the VM if it doesn't exist. 94 95 ### createInstance 96 97 createInstance receives a single VM to create an instance for and attempts to 98 idempotently create it. Instance creation in GCE is asynchronous, so the backend 99 package calls createInstance repeatedly until it's detected as created and then 100 records it. Creation is completed if already started for a [drained](#drained) 101 VM, but new creation tasks in GCE are not started for drained VMs. 102 103 ### manageBot 104 105 manageBot receives a single VM to manage a bot for. First checks if the Config 106 referenced by the VM no longer exists or no longer references the given VM and 107 [drains](#drain) the VM if it isn't already. Next, watches the Swarming server 108 for changes in the bot's state and reacts accordingly. If Swarming reports that 109 the bot has died or been deleted or terminated, triggers 110 [destroyInstance](#destroyInstance). If the VM's deadline has been exceeded or 111 the VM is [drained](#drained), triggers [terminateBot](#terminateBot). 112 113 ### destroyInstance 114 115 destroyInstance receives a single VM to destroy the created instance for and 116 attempts to idempotently destroy it. Instance deletion in GCE is asynchronous, 117 so the backend package calls destroyInstance repeatedly until it's detected as 118 destroyed and then triggers [deleteBot](#deleteBot). 119 120 ### deleteBot 121 122 deleteBot receives a single VM entity to delete the bot for. Bot deletion in 123 Swarming is synchronous, so this action is recorded immediately, which deletes 124 the VM. 125 126 ### terminateBot 127 128 terminateBot receives a single VM to terminate the bot for and attempts to 129 terminate it. Termination in Swarming is asynchronous, so the backend package 130 calls [manageBot](#manageBot) repeatedly until it's detected as terminated.