github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/orchestrators.md (about)

     1  # Orchestrators
     2  
     3  When we talk about an *orchestrator* in SwarmKit, we're not talking about
     4  SwarmKit as a whole, but a specific component that creates and shuts down tasks.
     5  In SwarmKit's [task model](task_model.md), a *service* gets translated into some
     6  number of *tasks*. The service is an abstract description of the workload, and
     7  the tasks are individual units that can be dispatched to specific nodes. An
     8  orchestrator manages these tasks.
     9  
    10  The scope of an orchestrator is fairly limited. It creates the corresponding
    11  tasks when a service is created, adds or removes tasks when a service is scaled,
    12  and deletes the linked tasks when a service is deleted. In general, it does not
    13  make scheduling decisions, which are left to the [scheduler](scheduler.md).
    14  However, the *global orchestrator* does create tasks that are bound to specific
    15  nodes, because tasks from global services can't be scheduled freely.
    16  
    17  ## Event handling
    18  
    19  There are two general types of events an orchestrator handles: service-level events
    20  and task-level events.
    21  
    22  Some examples of service-level events are a new service being created, or an
    23  existing service being updated. In these cases, the orchestrator will create
    24  and shut down tasks as necessary to satisfy the service definition.
    25  
    26  An example of a task-level event is a failure being reported for a particular
    27  task instance. In this case, the orchestrator will restart this task, if
    28  appropriate. (Note that *restart* in this context means starting a new task to
    29  replace the old one.) Node events are similar: if a node fails, the orchestrator
    30  can restart tasks which ran on that node.
    31  
    32  This combination of events makes the orchestrator more efficient. A simple,
    33  naive design would involve reconciling the service every time a relevant event
    34  is received. Scaling a service and replacing a failed task could be handled
    35  through the same code, which would compare the set of running tasks with the set
    36  of tasks that are supposed to be running, and create or shut down tasks as
    37  necessary. This would be quite inefficient though. Every time something needed
    38  to trigger a task restart, we'd have to look at every task in the service. By
    39  handling task events separately, an orchestrator can avoid looking at the whole
    40  service except when the service itself changes.
    41  
    42  ## Initialization
    43  
    44  When an orchestrator starts up, it needs to do an initial reconciliation pass to
    45  make sure tasks are consistent with the service definitions. In steady-state
    46  operation, actions like restarting failed tasks and deleting tasks when a
    47  service is deleted happen in response to events. However, if there is a
    48  leadership change or cluster restart, some events may have gone unhandled by the
    49  orchestrator. At startup, `CheckTasks` iterates over all the tasks in the store
    50  and takes care of anything that should normally have been handled by an event
    51  handler.
    52  
    53  ## Replicated orchestrator
    54  
    55  The replicated orchestrator only acts on replicated services, and tasks
    56  associated with replicated services. It ignores other services and tasks.
    57  
    58  There's not much magic to speak of. The replicated orchestrator responds to some
    59  task events by triggering restarts through the restart supervisor, which is also
    60  used by the global orchestrator. The restart supervisor is explained in more
    61  detail below. The replicated orchestrator responds to service creations and
    62  updates by reconciling the service, a process that relies on the update
    63  supervisor, also shared by the global orchestrator. When a replicated service is
    64  deleted, the replicated orchestrator deletes all of its tasks.
    65  
    66  The service reconciliation process starts by grouping a service's tasks by slot
    67  number (see the explanation of slots in the [task model](task_model.md)
    68  document). These slots are marked either runnable or dead - runnable if at least
    69  one task has a desired state of `Running` or below, and dead otherwise.
    70  
    71  If there are fewer runnable slots than the number of replicas specified in the
    72  service spec, the orchestrator creates the right number of tasks to make up the
    73  difference, assigning them slot numbers that don't conflict with any runnable
    74  slots.
    75  
    76  If there are more runnable slots than the number of replicas specified in the
    77  service spec, the orchestrator deletes extra tasks. It attempts to remove tasks
    78  on nodes that have the most instances of this service running, to maintain
    79  balance in the way tasks are assigned to nodes. When there's a tie between the
    80  number of tasks running on multiple nodes, it prefers to remove tasks that
    81  aren't running (in terms of observed state) over tasks that are currently
    82  running. Note that scale-down decisions are made by the orchestrator, and don't
    83  quite match the state the scheduler would arrive at when scaling up. This is an
    84  area for future improvement; see https://github.com/docker/swarmkit/issues/2320
    85  for more details.
    86  
    87  In both of these cases, and also in the case where the number of replicas is
    88  already correct, the orchestrator calls the update supervisor to ensure that the
    89  existing tasks (or tasks being kept, in the case of a scale-down) are
    90  up-to-date. The update supervisor does the heavy lifting involved in rolling
    91  updates and automatic rollbacks, but this is all abstracted from the
    92  orchestrator.
    93  
    94  ## Global orchestrator
    95  
    96  The global orchestrator works similarly to the replicated orchestrator, but
    97  tries to maintain one task per active node meeting the constraints, instead of a
    98  specific total number of replicas. It ignores services that aren't global
    99  services and tasks that aren't associated with global services.
   100  
   101  The global orchestrator responds to task events in much the same way that the
   102  replicated orchestrator does. If a task fails, the global orchestrator will
   103  indicate to the restart supervisor that a restart may be needed.
   104  
   105  When a service is created, updated, or deleted, this triggers a reconciliation.
   106  The orchestrator has to check whether each node meets the constraints for the
   107  service, and create or update tasks on that node if it does. The tasks are
   108  created with a specific node ID pre-filled. They pass through the scheduler so
   109  that the scheduler can wait for the nodes to have sufficient resources before
   110  moving the desired state to `Assigned`, but the scheduler does not make the
   111  actual scheduling decision.
   112  
   113  The global orchestrator also responds to node events. These trigger
   114  reconciliations much like service events do. A new node might mean creating a
   115  task from each service on that node, and a deleted node would mean deleting any
   116  global service tasks from that node. When a node gets drained, the global
   117  orchestrator shuts down any global service tasks running on that node. It also
   118  does this when a node goes down, which avoids stuck rolling updates that would
   119  otherwise want to update the task on the unavailable node before proceeding.
   120  
   121  Like the replicated orchestrator, the global orchestrator uses the update
   122  supervisor to implement rolling updates and automatic rollbacks. Instead of
   123  passing tasks to the update supervisor by slot, it groups them by node. This
   124  means rolling updates will go node-by-node instead of slot-by-slot.
   125  
   126  ## Restart supervisor
   127  
   128  The restart supervisor manages the process of shutting down a task, and
   129  possibly starting a replacement task. Its entry point is a `Restart` method
   130  which is called inside a store write transaction in one of the orchestrators.
   131  It atomically changes the desired state of the old task to `Shutdown`, and, if
   132  it's appropriate to start a replacement task based on the service's restart
   133  policy, creates a new task in the same slot (replicated service) or on the same
   134  node (global service).
   135  
   136  If the service is set up with a restart delay, the restart supervisor handles
   137  this delay too. It initially creates the new task with the desired state
   138  `Ready`, and only changes the desired state to `Running` after the delay has
   139  elapsed. One of the things the orchestrators do when they start up is check for
   140  tasks that were in this delay phase of being restarted, and make sure they get
   141  advanced to `Running`.
   142  
   143  In some cases, a task can fail or be rejected before its desired state reaches
   144  `Running`. One example is a failure to pull an image from a registry. The
   145  restart supervisor tries to make sure this doesn't result in fast restart loops
   146  that effectively ignore the restart delay. If `Restart` is called on a task that
   147  the restart supervisor is still in the process of starting up - i.e. it hasn't
   148  moved the task to `Running` yet - it will wait for the restart delay to elapse
   149  before triggering this second restart.
   150  
   151  The restart supervisor implements the logic to decide whether a task should be
   152  restarted, and since this can be dependent on restart history (when
   153  `MaxAttempts`) is set, the restart supervisor keeps track of this history. The
   154  history isn't persisted, so some restart behavior may be slightly off after a
   155  restart or leader election.
   156  
   157  Note that a call to `Restart` doesn't always end up with the task being
   158  restarted - this depends on the service's configuration. `Restart` can be
   159  understood as "make sure this task gets shut down, and maybe start a replacement
   160  if the service configuration says to".
   161  
   162  ## Update supervisor
   163  
   164  The update supervisor is the component that updates existing tasks to match the
   165  latest version of the service. This means shutting down the old task and
   166  starting a new one to replace it. The update supervisor implements rolling
   167  updates and automatic rollback.
   168  
   169  The update supervisor operates on an abstract notion of slots, which are either
   170  slot numbers for replicated services, or node IDs for global services. You can
   171  think of it as reconciling the contents of each slot with the service. If a slot
   172  has more than one task or fewer than one task, it corrects that. If the task (or
   173  tasks) in a slot are out of date, they are replaced with a single task that's up
   174  to date.
   175  
   176  Every time the update supervisor is called to start an update of a service, it
   177  spawns an `Updater` set up to work toward this goal. Each service can only have
   178  one `Updater` at once, so if the service already had a different update in
   179  progress, it is interrupted and replaced by the new one. The `Updater` runs in
   180  its own goroutine, going through the slots and reconciling them with the
   181  current service. It starts by checking which of the slots are dirty. If they
   182  are all up to date and have a single task, it can finish immediately.
   183  Otherwise, it starts as many worker goroutines as the update parallelism
   184  setting allows, and lets them consume dirty slots from a channel.
   185  
   186  The workers do the work of reconciling an individual slot with the service. If
   187  there is a runnable task in the slot which is up to date, this may only involve
   188  starting up the up-to-date task and shutting down the other tasks. Otherwise,
   189  the worker will shut down all tasks in the slot and create a new one that's
   190  up-to-date. It can either do this atomically, or start the new task before the
   191  old one shuts down, depending on the update settings.
   192  
   193  The updater watches task events to see if any of the new tasks it created fail
   194  while the update is still running. If enough fail, and the update is set up to
   195  pause or roll back after a certain threshold of failures, the updater will pause
   196  or roll back the update. Pausing involves setting `UpdateStatus.State` on the
   197  service to "paused". This is recognized as a paused update by the updater, and
   198  it won't try to update the service again until the flag gets cleared by
   199  `controlapi` the next time a client updates the service. Rolling back involves
   200  setting `UpdateStatus.State` to "rollback started", then copying `PreviousSpec`
   201  into `Spec`, updating `SpecVersion` accordingly, and clearing `PreviousSpec`.
   202  This triggers a reconciliation in the replicated or global orchestrator, which
   203  ends up calling the update supervisor again to "update" the tasks to the
   204  previous version of the service. Effectively, the updater just gets called again
   205  in reverse. The updater knows when it's being used in a rollback scenario, based
   206  on `UpdateStatus.State`, so it can choose the appropriate update parameters and
   207  avoid rolling back a rollback, but other than that, the logic is the same
   208  whether an update is moving forward or in reverse.
   209  
   210  The updater waits the time interval given by `Monitor` after the update
   211  completes. This allows it to notice problems after it's done updating tasks, and
   212  take actions that were requested for failure cases. For example, if a service
   213  only has one task, has `Monitor` set to 5 seconds, and `FailureAction` set to
   214  "rollback", the updater will wait 5 seconds after updating the task. Then, if
   215  the new task fails within 5 seconds, the updater will be able to trigger a
   216  rollback. Without waiting, the updater would end up finishing immediately after
   217  creating and starting the new task, and probably wouldn't be around to respond
   218  by the time the task failed.
   219  
   220  ## Task reaper
   221  
   222  As discussed above, restarting a task involves shutting down the old task and
   223  starting a new one. If restarts happen frequently, a lot of old tasks that
   224  aren't actually running might accumulate.
   225  
   226  The task reaper implements configurable garbage collection of these
   227  no-longer-running tasks. The number of old tasks to keep per slot or node is
   228  controlled by `Orchestration.TaskHistoryRetentionLimit` in the cluster's
   229  `ClusterSpec`.
   230  
   231  The task reaper watches for task creation events, and adds the slots or nodes
   232  from these events to a watchlist. It periodically iterates over the watchlist
   233  and deletes tasks from referenced slots or nodes which exceed the retention
   234  limit. It prefers to delete tasks with the oldest `Status` timestamps.