go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/gce/appengine/backend/README.md (about)

     1  # GCE Backend
     2  
     3  A backend package for the GCE GAE app. Comprised of independent, idempotent
     4  cron jobs which trigger independent, idempotent task queues which attempt to
     5  move the real-world state of GCE instances closer to the configured state of GCE
     6  instances. The cron jobs and task queues are fault tolerant-- failures do not
     7  generally cause inconsistent state, allowing the task queues to be triggered
     8  again later by the cron jobs. This means transient failures such as datastore or
     9  network outages and insufficient permissions or quota only cause failures in the
    10  backend package as long as they remain unresolved. Once the issues are resolved,
    11  the backend package should recover without intervention.
    12  
    13  ## Terminology
    14  
    15  ### Config
    16  
    17  A Config is a datastore entity representing a configured type of VM. Creation of
    18  Configs is outside the scope of the backend package. Configs are mutable and may
    19  be created, updated, or even deleted at any time and the backend package will
    20  react accordingly.
    21  
    22  ### VM
    23  
    24  A VM is a datastore entity representing a single configured VM, derived from a
    25  Config. [expandConfig](#expandConfig) is responsible for the derivation. VMs
    26  are mutable, but should only be modified by the backend package. To make changes
    27  to a VM, modify its corresponding Config and the backend package will propagate
    28  the changes. The Config:VM mapping is 1:n.
    29  
    30  ### GCE Instance
    31  
    32  A GCE instance is a live virtual machine running in Google Compute Engine. An
    33  instance is created from a VM by [createInstance](#createInstance). Instances
    34  are immutable. Changes made to a VM will only be reflected when creating a new
    35  instance. The VM:instance mapping is 1:1.
    36  
    37  ### Swarming Bot
    38  
    39  A Swarming bot is the Swarming server's view of a connected instance. Instances
    40  automatically register themselves as bots of a particular Swarming server
    41  outside the scope of the backend package. Bots may freely be terminated or
    42  deleted from the Swarming server and the backend package will react accordingly.
    43  The instance:bot mapping is 1:1.
    44  
    45  ### Deadline
    46  
    47  The deadline is how long an instance may live for. An instance's deadline is
    48  derived from the lifetime in the Config and the instance creation time. Once the
    49  deadline is up, the backend package will attempt to replace the instance after
    50  it finishes its current Swarming workload. Replacing the instance is how changes
    51  to VMs are picked up, since instances are immutable.
    52  
    53  ### Drained
    54  
    55  A drained VM is one scheduled for deletion because the Config has been altered
    56  to have its number of VMs decreased. A drained VM will be deleted once its
    57  corresponding instance has been deleted. A drained Config is one scheduled for
    58  deletion by some external factor. All VMs of a drained Config will be drained. A
    59  drained Config will be deleted once its corresponding VMs have been deleted.
    60  
    61  ## Cron Jobs
    62  
    63  All cron jobs operate on multiple entities, triggering task queues which operate
    64  on a particular entity. All cron jobs are idempotent.
    65  
    66  ### expandConfigsAsync
    67  
    68  expandConfigsAsync iterates over all Configs and triggers
    69  [expandConfig](#expandConfig) for each.
    70  
    71  ### createInstancesAsync
    72  
    73  createInstancesAsync iterates over all VMs which have no corresponding instance
    74  and triggers [createInstance](#createInstance) for each.
    75  
    76  ### manageBotsAsync
    77  
    78  manageBotsAsync iterates over all VMs which do have a corresponding instance and
    79  triggers [manageBot](#manageBot) for each.
    80  
    81  ## Task Queues
    82  
    83  All task queues are triggered with a particular entity to process. All task
    84  queues are idempotent.
    85  
    86  ### expandConfig
    87  
    88  expandConfig receives a single Config to expand. It checks how many VMs the
    89  Config declares and triggers [createVM](#createVM) for each.
    90  
    91  ### createVM
    92  
    93  createVM receives a single VM to create. It creates the VM if it doesn't exist.
    94  
    95  ### createInstance
    96  
    97  createInstance receives a single VM to create an instance for and attempts to
    98  idempotently create it. Instance creation in GCE is asynchronous, so the backend
    99  package calls createInstance repeatedly until it's detected as created and then
   100  records it. Creation is completed if already started for a [drained](#drained)
   101  VM, but new creation tasks in GCE are not started for drained VMs.
   102  
   103  ### manageBot
   104  
   105  manageBot receives a single VM to manage a bot for. First checks if the Config
   106  referenced by the VM no longer exists or no longer references the given VM and
   107  [drains](#drain) the VM if it isn't already. Next, watches the Swarming server
   108  for changes in the bot's state and reacts accordingly. If Swarming reports that
   109  the bot has died or been deleted or terminated, triggers
   110  [destroyInstance](#destroyInstance). If the VM's deadline has been exceeded or
   111  the VM is [drained](#drained), triggers [terminateBot](#terminateBot).
   112  
   113  ### destroyInstance
   114  
   115  destroyInstance receives a single VM to destroy the created instance for and
   116  attempts to idempotently destroy it. Instance deletion in GCE is asynchronous,
   117  so the backend package calls destroyInstance repeatedly until it's detected as
   118  destroyed and then triggers [deleteBot](#deleteBot).
   119  
   120  ### deleteBot
   121  
   122  deleteBot receives a single VM entity to delete the bot for. Bot deletion in
   123  Swarming is synchronous, so this action is recorded immediately, which deletes
   124  the VM.
   125  
   126  ### terminateBot
   127  
   128  terminateBot receives a single VM to terminate the bot for and attempts to
   129  terminate it. Termination in Swarming is asynchronous, so the backend package
   130  calls [manageBot](#manageBot) repeatedly until it's detected as terminated.