github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/orchestrators.md (about) 1 # Orchestrators 2 3 When we talk about an *orchestrator* in SwarmKit, we're not talking about 4 SwarmKit as a whole, but a specific component that creates and shuts down tasks. 5 In SwarmKit's [task model](task_model.md), a *service* gets translated into some 6 number of *tasks*. The service is an abstract description of the workload, and 7 the tasks are individual units that can be dispatched to specific nodes. An 8 orchestrator manages these tasks. 9 10 The scope of an orchestrator is fairly limited. It creates the corresponding 11 tasks when a service is created, adds or removes tasks when a service is scaled, 12 and deletes the linked tasks when a service is deleted. In general, it does not 13 make scheduling decisions, which are left to the [scheduler](scheduler.md). 14 However, the *global orchestrator* does create tasks that are bound to specific 15 nodes, because tasks from global services can't be scheduled freely. 16 17 ## Event handling 18 19 There are two general types of events an orchestrator handles: service-level events 20 and task-level events. 21 22 Some examples of service-level events are a new service being created, or an 23 existing service being updated. In these cases, the orchestrator will create 24 and shut down tasks as necessary to satisfy the service definition. 25 26 An example of a task-level event is a failure being reported for a particular 27 task instance. In this case, the orchestrator will restart this task, if 28 appropriate. (Note that *restart* in this context means starting a new task to 29 replace the old one.) Node events are similar: if a node fails, the orchestrator 30 can restart tasks which ran on that node. 31 32 This combination of events makes the orchestrator more efficient. A simple, 33 naive design would involve reconciling the service every time a relevant event 34 is received. Scaling a service and replacing a failed task could be handled 35 through the same code, which would compare the set of running tasks with the set 36 of tasks that are supposed to be running, and create or shut down tasks as 37 necessary. This would be quite inefficient though. Every time something needed 38 to trigger a task restart, we'd have to look at every task in the service. By 39 handling task events separately, an orchestrator can avoid looking at the whole 40 service except when the service itself changes. 41 42 ## Initialization 43 44 When an orchestrator starts up, it needs to do an initial reconciliation pass to 45 make sure tasks are consistent with the service definitions. In steady-state 46 operation, actions like restarting failed tasks and deleting tasks when a 47 service is deleted happen in response to events. However, if there is a 48 leadership change or cluster restart, some events may have gone unhandled by the 49 orchestrator. At startup, `CheckTasks` iterates over all the tasks in the store 50 and takes care of anything that should normally have been handled by an event 51 handler. 52 53 ## Replicated orchestrator 54 55 The replicated orchestrator only acts on replicated services, and tasks 56 associated with replicated services. It ignores other services and tasks. 57 58 There's not much magic to speak of. The replicated orchestrator responds to some 59 task events by triggering restarts through the restart supervisor, which is also 60 used by the global orchestrator. The restart supervisor is explained in more 61 detail below. The replicated orchestrator responds to service creations and 62 updates by reconciling the service, a process that relies on the update 63 supervisor, also shared by the global orchestrator. When a replicated service is 64 deleted, the replicated orchestrator deletes all of its tasks. 65 66 The service reconciliation process starts by grouping a service's tasks by slot 67 number (see the explanation of slots in the [task model](task_model.md) 68 document). These slots are marked either runnable or dead - runnable if at least 69 one task has a desired state of `Running` or below, and dead otherwise. 70 71 If there are fewer runnable slots than the number of replicas specified in the 72 service spec, the orchestrator creates the right number of tasks to make up the 73 difference, assigning them slot numbers that don't conflict with any runnable 74 slots. 75 76 If there are more runnable slots than the number of replicas specified in the 77 service spec, the orchestrator deletes extra tasks. It attempts to remove tasks 78 on nodes that have the most instances of this service running, to maintain 79 balance in the way tasks are assigned to nodes. When there's a tie between the 80 number of tasks running on multiple nodes, it prefers to remove tasks that 81 aren't running (in terms of observed state) over tasks that are currently 82 running. Note that scale-down decisions are made by the orchestrator, and don't 83 quite match the state the scheduler would arrive at when scaling up. This is an 84 area for future improvement; see https://github.com/docker/swarmkit/issues/2320 85 for more details. 86 87 In both of these cases, and also in the case where the number of replicas is 88 already correct, the orchestrator calls the update supervisor to ensure that the 89 existing tasks (or tasks being kept, in the case of a scale-down) are 90 up-to-date. The update supervisor does the heavy lifting involved in rolling 91 updates and automatic rollbacks, but this is all abstracted from the 92 orchestrator. 93 94 ## Global orchestrator 95 96 The global orchestrator works similarly to the replicated orchestrator, but 97 tries to maintain one task per active node meeting the constraints, instead of a 98 specific total number of replicas. It ignores services that aren't global 99 services and tasks that aren't associated with global services. 100 101 The global orchestrator responds to task events in much the same way that the 102 replicated orchestrator does. If a task fails, the global orchestrator will 103 indicate to the restart supervisor that a restart may be needed. 104 105 When a service is created, updated, or deleted, this triggers a reconciliation. 106 The orchestrator has to check whether each node meets the constraints for the 107 service, and create or update tasks on that node if it does. The tasks are 108 created with a specific node ID pre-filled. They pass through the scheduler so 109 that the scheduler can wait for the nodes to have sufficient resources before 110 moving the desired state to `Assigned`, but the scheduler does not make the 111 actual scheduling decision. 112 113 The global orchestrator also responds to node events. These trigger 114 reconciliations much like service events do. A new node might mean creating a 115 task from each service on that node, and a deleted node would mean deleting any 116 global service tasks from that node. When a node gets drained, the global 117 orchestrator shuts down any global service tasks running on that node. It also 118 does this when a node goes down, which avoids stuck rolling updates that would 119 otherwise want to update the task on the unavailable node before proceeding. 120 121 Like the replicated orchestrator, the global orchestrator uses the update 122 supervisor to implement rolling updates and automatic rollbacks. Instead of 123 passing tasks to the update supervisor by slot, it groups them by node. This 124 means rolling updates will go node-by-node instead of slot-by-slot. 125 126 ## Restart supervisor 127 128 The restart supervisor manages the process of shutting down a task, and 129 possibly starting a replacement task. Its entry point is a `Restart` method 130 which is called inside a store write transaction in one of the orchestrators. 131 It atomically changes the desired state of the old task to `Shutdown`, and, if 132 it's appropriate to start a replacement task based on the service's restart 133 policy, creates a new task in the same slot (replicated service) or on the same 134 node (global service). 135 136 If the service is set up with a restart delay, the restart supervisor handles 137 this delay too. It initially creates the new task with the desired state 138 `Ready`, and only changes the desired state to `Running` after the delay has 139 elapsed. One of the things the orchestrators do when they start up is check for 140 tasks that were in this delay phase of being restarted, and make sure they get 141 advanced to `Running`. 142 143 In some cases, a task can fail or be rejected before its desired state reaches 144 `Running`. One example is a failure to pull an image from a registry. The 145 restart supervisor tries to make sure this doesn't result in fast restart loops 146 that effectively ignore the restart delay. If `Restart` is called on a task that 147 the restart supervisor is still in the process of starting up - i.e. it hasn't 148 moved the task to `Running` yet - it will wait for the restart delay to elapse 149 before triggering this second restart. 150 151 The restart supervisor implements the logic to decide whether a task should be 152 restarted, and since this can be dependent on restart history (when 153 `MaxAttempts`) is set, the restart supervisor keeps track of this history. The 154 history isn't persisted, so some restart behavior may be slightly off after a 155 restart or leader election. 156 157 Note that a call to `Restart` doesn't always end up with the task being 158 restarted - this depends on the service's configuration. `Restart` can be 159 understood as "make sure this task gets shut down, and maybe start a replacement 160 if the service configuration says to". 161 162 ## Update supervisor 163 164 The update supervisor is the component that updates existing tasks to match the 165 latest version of the service. This means shutting down the old task and 166 starting a new one to replace it. The update supervisor implements rolling 167 updates and automatic rollback. 168 169 The update supervisor operates on an abstract notion of slots, which are either 170 slot numbers for replicated services, or node IDs for global services. You can 171 think of it as reconciling the contents of each slot with the service. If a slot 172 has more than one task or fewer than one task, it corrects that. If the task (or 173 tasks) in a slot are out of date, they are replaced with a single task that's up 174 to date. 175 176 Every time the update supervisor is called to start an update of a service, it 177 spawns an `Updater` set up to work toward this goal. Each service can only have 178 one `Updater` at once, so if the service already had a different update in 179 progress, it is interrupted and replaced by the new one. The `Updater` runs in 180 its own goroutine, going through the slots and reconciling them with the 181 current service. It starts by checking which of the slots are dirty. If they 182 are all up to date and have a single task, it can finish immediately. 183 Otherwise, it starts as many worker goroutines as the update parallelism 184 setting allows, and lets them consume dirty slots from a channel. 185 186 The workers do the work of reconciling an individual slot with the service. If 187 there is a runnable task in the slot which is up to date, this may only involve 188 starting up the up-to-date task and shutting down the other tasks. Otherwise, 189 the worker will shut down all tasks in the slot and create a new one that's 190 up-to-date. It can either do this atomically, or start the new task before the 191 old one shuts down, depending on the update settings. 192 193 The updater watches task events to see if any of the new tasks it created fail 194 while the update is still running. If enough fail, and the update is set up to 195 pause or roll back after a certain threshold of failures, the updater will pause 196 or roll back the update. Pausing involves setting `UpdateStatus.State` on the 197 service to "paused". This is recognized as a paused update by the updater, and 198 it won't try to update the service again until the flag gets cleared by 199 `controlapi` the next time a client updates the service. Rolling back involves 200 setting `UpdateStatus.State` to "rollback started", then copying `PreviousSpec` 201 into `Spec`, updating `SpecVersion` accordingly, and clearing `PreviousSpec`. 202 This triggers a reconciliation in the replicated or global orchestrator, which 203 ends up calling the update supervisor again to "update" the tasks to the 204 previous version of the service. Effectively, the updater just gets called again 205 in reverse. The updater knows when it's being used in a rollback scenario, based 206 on `UpdateStatus.State`, so it can choose the appropriate update parameters and 207 avoid rolling back a rollback, but other than that, the logic is the same 208 whether an update is moving forward or in reverse. 209 210 The updater waits the time interval given by `Monitor` after the update 211 completes. This allows it to notice problems after it's done updating tasks, and 212 take actions that were requested for failure cases. For example, if a service 213 only has one task, has `Monitor` set to 5 seconds, and `FailureAction` set to 214 "rollback", the updater will wait 5 seconds after updating the task. Then, if 215 the new task fails within 5 seconds, the updater will be able to trigger a 216 rollback. Without waiting, the updater would end up finishing immediately after 217 creating and starting the new task, and probably wouldn't be around to respond 218 by the time the task failed. 219 220 ## Task reaper 221 222 As discussed above, restarting a task involves shutting down the old task and 223 starting a new one. If restarts happen frequently, a lot of old tasks that 224 aren't actually running might accumulate. 225 226 The task reaper implements configurable garbage collection of these 227 no-longer-running tasks. The number of old tasks to keep per slot or node is 228 controlled by `Orchestration.TaskHistoryRetentionLimit` in the cluster's 229 `ClusterSpec`. 230 231 The task reaper watches for task creation events, and adds the slots or nodes 232 from these events to a watchlist. It periodically iterates over the watchlist 233 and deletes tasks from referenced slots or nodes which exceed the retention 234 limit. It prefers to delete tasks with the oldest `Status` timestamps.