github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/orchestrators.md

github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/orchestrators.md (about)

1 # Orchestrators
2
3 When we talk about an *orchestrator* in SwarmKit, we're not talking about
4 SwarmKit as a whole, but a specific component that creates and shuts down tasks.
5 In SwarmKit's [task model](task_model.md), a *service* gets translated into some
6 number of *tasks*. The service is an abstract description of the workload, and
7 the tasks are individual units that can be dispatched to specific nodes. An
8 orchestrator manages these tasks.
9
10 The scope of an orchestrator is fairly limited. It creates the corresponding
11 tasks when a service is created, adds or removes tasks when a service is scaled,
12 and deletes the linked tasks when a service is deleted. In general, it does not
13 make scheduling decisions, which are left to the [scheduler](scheduler.md).
14 However, the *global orchestrator* does create tasks that are bound to specific
15 nodes, because tasks from global services can't be scheduled freely.
16
17 ## Event handling
18
19 There are two general types of events an orchestrator handles: service-level events
20 and task-level events.
21
22 Some examples of service-level events are a new service being created, or an
23 existing service being updated. In these cases, the orchestrator will create
24 and shut down tasks as necessary to satisfy the service definition.
25
26 An example of a task-level event is a failure being reported for a particular
27 task instance. In this case, the orchestrator will restart this task, if
28 appropriate. (Note that *restart* in this context means starting a new task to
29 replace the old one.) Node events are similar: if a node fails, the orchestrator
30 can restart tasks which ran on that node.
31
32 This combination of events makes the orchestrator more efficient. A simple,
33 naive design would involve reconciling the service every time a relevant event
34 is received. Scaling a service and replacing a failed task could be handled
35 through the same code, which would compare the set of running tasks with the set
36 of tasks that are supposed to be running, and create or shut down tasks as
37 necessary. This would be quite inefficient though. Every time something needed
38 to trigger a task restart, we'd have to look at every task in the service. By
39 handling task events separately, an orchestrator can avoid looking at the whole
40 service except when the service itself changes.
41
42 ## Initialization
43
44 When an orchestrator starts up, it needs to do an initial reconciliation pass to
45 make sure tasks are consistent with the service definitions. In steady-state
46 operation, actions like restarting failed tasks and deleting tasks when a
47 service is deleted happen in response to events. However, if there is a
48 leadership change or cluster restart, some events may have gone unhandled by the
49 orchestrator. At startup, `CheckTasks` iterates over all the tasks in the store
50 and takes care of anything that should normally have been handled by an event
51 handler.
52
53 ## Replicated orchestrator
54
55 The replicated orchestrator only acts on replicated services, and tasks
56 associated with replicated services. It ignores other services and tasks.
57
58 There's not much magic to speak of. The replicated orchestrator responds to some
59 task events by triggering restarts through the restart supervisor, which is also
60 used by the global orchestrator. The restart supervisor is explained in more
61 detail below. The replicated orchestrator responds to service creations and
62 updates by reconciling the service, a process that relies on the update
63 supervisor, also shared by the global orchestrator. When a replicated service is
64 deleted, the replicated orchestrator deletes all of its tasks.
65
66 The service reconciliation process starts by grouping a service's tasks by slot
67 number (see the explanation of slots in the [task model](task_model.md)
68 document). These slots are marked either runnable or dead - runnable if at least
69 one task has a desired state of `Running` or below, and dead otherwise.
70
71 If there are fewer runnable slots than the number of replicas specified in the
72 service spec, the orchestrator creates the right number of tasks to make up the
73 difference, assigning them slot numbers that don't conflict with any runnable
74 slots.
75
76 If there are more runnable slots than the number of replicas specified in the
77 service spec, the orchestrator deletes extra tasks. It attempts to remove tasks
78 on nodes that have the most instances of this service running, to maintain
79 balance in the way tasks are assigned to nodes. When there's a tie between the
80 number of tasks running on multiple nodes, it prefers to remove tasks that
81 aren't running (in terms of observed state) over tasks that are currently
82 running. Note that scale-down decisions are made by the orchestrator, and don't
83 quite match the state the scheduler would arrive at when scaling up. This is an
84 area for future improvement; see https://github.com/docker/swarmkit/issues/2320
85 for more details.
86
87 In both of these cases, and also in the case where the number of replicas is
88 already correct, the orchestrator calls the update supervisor to ensure that the
89 existing tasks (or tasks being kept, in the case of a scale-down) are
90 up-to-date. The update supervisor does the heavy lifting involved in rolling
91 updates and automatic rollbacks, but this is all abstracted from the
92 orchestrator.
93
94 ## Global orchestrator
95
96 The global orchestrator works similarly to the replicated orchestrator, but
97 tries to maintain one task per active node meeting the constraints, instead of a
98 specific total number of replicas. It ignores services that aren't global
99 services and tasks that aren't associated with global services.
100
101 The global orchestrator responds to task events in much the same way that the
102 replicated orchestrator does. If a task fails, the global orchestrator will
103 indicate to the restart supervisor that a restart may be needed.
104
105 When a service is created, updated, or deleted, this triggers a reconciliation.
106 The orchestrator has to check whether each node meets the constraints for the
107 service, and create or update tasks on that node if it does. The tasks are
108 created with a specific node ID pre-filled. They pass through the scheduler so
109 that the scheduler can wait for the nodes to have sufficient resources before
110 moving the desired state to `Assigned`, but the scheduler does not make the
111 actual scheduling decision.
112
113 The global orchestrator also responds to node events. These trigger
114 reconciliations much like service events do. A new node might mean creating a
115 task from each service on that node, and a deleted node would mean deleting any
116 global service tasks from that node. When a node gets drained, the global
117 orchestrator shuts down any global service tasks running on that node. It also
118 does this when a node goes down, which avoids stuck rolling updates that would
119 otherwise want to update the task on the unavailable node before proceeding.
120
121 Like the replicated orchestrator, the global orchestrator uses the update
122 supervisor to implement rolling updates and automatic rollbacks. Instead of
123 passing tasks to the update supervisor by slot, it groups them by node. This
124 means rolling updates will go node-by-node instead of slot-by-slot.
125
126 ## Restart supervisor
127
128 The restart supervisor manages the process of shutting down a task, and
129 possibly starting a replacement task. Its entry point is a `Restart` method
130 which is called inside a store write transaction in one of the orchestrators.
131 It atomically changes the desired state of the old task to `Shutdown`, and, if
132 it's appropriate to start a replacement task based on the service's restart
133 policy, creates a new task in the same slot (replicated service) or on the same
134 node (global service).
135
136 If the service is set up with a restart delay, the restart supervisor handles
137 this delay too. It initially creates the new task with the desired state
138 `Ready`, and only changes the desired state to `Running` after the delay has
139 elapsed. One of the things the orchestrators do when they start up is check for
140 tasks that were in this delay phase of being restarted, and make sure they get
141 advanced to `Running`.
142
143 In some cases, a task can fail or be rejected before its desired state reaches
144 `Running`. One example is a failure to pull an image from a registry. The
145 restart supervisor tries to make sure this doesn't result in fast restart loops
146 that effectively ignore the restart delay. If `Restart` is called on a task that
147 the restart supervisor is still in the process of starting up - i.e. it hasn't
148 moved the task to `Running` yet - it will wait for the restart delay to elapse
149 before triggering this second restart.
150
151 The restart supervisor implements the logic to decide whether a task should be
152 restarted, and since this can be dependent on restart history (when
153 `MaxAttempts`) is set, the restart supervisor keeps track of this history. The
154 history isn't persisted, so some restart behavior may be slightly off after a
155 restart or leader election.
156
157 Note that a call to `Restart` doesn't always end up with the task being
158 restarted - this depends on the service's configuration. `Restart` can be
159 understood as "make sure this task gets shut down, and maybe start a replacement
160 if the service configuration says to".
161
162 ## Update supervisor
163
164 The update supervisor is the component that updates existing tasks to match the
165 latest version of the service. This means shutting down the old task and
166 starting a new one to replace it. The update supervisor implements rolling
167 updates and automatic rollback.
168
169 The update supervisor operates on an abstract notion of slots, which are either
170 slot numbers for replicated services, or node IDs for global services. You can
171 think of it as reconciling the contents of each slot with the service. If a slot
172 has more than one task or fewer than one task, it corrects that. If the task (or
173 tasks) in a slot are out of date, they are replaced with a single task that's up
174 to date.
175
176 Every time the update supervisor is called to start an update of a service, it
177 spawns an `Updater` set up to work toward this goal. Each service can only have
178 one `Updater` at once, so if the service already had a different update in
179 progress, it is interrupted and replaced by the new one. The `Updater` runs in
180 its own goroutine, going through the slots and reconciling them with the
181 current service. It starts by checking which of the slots are dirty. If they
182 are all up to date and have a single task, it can finish immediately.
183 Otherwise, it starts as many worker goroutines as the update parallelism
184 setting allows, and lets them consume dirty slots from a channel.
185
186 The workers do the work of reconciling an individual slot with the service. If
187 there is a runnable task in the slot which is up to date, this may only involve
188 starting up the up-to-date task and shutting down the other tasks. Otherwise,
189 the worker will shut down all tasks in the slot and create a new one that's
190 up-to-date. It can either do this atomically, or start the new task before the
191 old one shuts down, depending on the update settings.
192
193 The updater watches task events to see if any of the new tasks it created fail
194 while the update is still running. If enough fail, and the update is set up to
195 pause or roll back after a certain threshold of failures, the updater will pause
196 or roll back the update. Pausing involves setting `UpdateStatus.State` on the
197 service to "paused". This is recognized as a paused update by the updater, and
198 it won't try to update the service again until the flag gets cleared by
199 `controlapi` the next time a client updates the service. Rolling back involves
200 setting `UpdateStatus.State` to "rollback started", then copying `PreviousSpec`
201 into `Spec`, updating `SpecVersion` accordingly, and clearing `PreviousSpec`.
202 This triggers a reconciliation in the replicated or global orchestrator, which
203 ends up calling the update supervisor again to "update" the tasks to the
204 previous version of the service. Effectively, the updater just gets called again
205 in reverse. The updater knows when it's being used in a rollback scenario, based
206 on `UpdateStatus.State`, so it can choose the appropriate update parameters and
207 avoid rolling back a rollback, but other than that, the logic is the same
208 whether an update is moving forward or in reverse.
209
210 The updater waits the time interval given by `Monitor` after the update
211 completes. This allows it to notice problems after it's done updating tasks, and
212 take actions that were requested for failure cases. For example, if a service
213 only has one task, has `Monitor` set to 5 seconds, and `FailureAction` set to
214 "rollback", the updater will wait 5 seconds after updating the task. Then, if
215 the new task fails within 5 seconds, the updater will be able to trigger a
216 rollback. Without waiting, the updater would end up finishing immediately after
217 creating and starting the new task, and probably wouldn't be around to respond
218 by the time the task failed.
219
220 ## Task reaper
221
222 As discussed above, restarting a task involves shutting down the old task and
223 starting a new one. If restarts happen frequently, a lot of old tasks that
224 aren't actually running might accumulate.
225
226 The task reaper implements configurable garbage collection of these
227 no-longer-running tasks. The number of old tasks to keep per slot or node is
228 controlled by `Orchestration.TaskHistoryRetentionLimit` in the cluster's
229 `ClusterSpec`.
230
231 The task reaper watches for task creation events, and adds the slots or nodes
232 from these events to a watchlist. It periodically iterates over the watchlist
233 and deletes tasks from referenced slots or nodes which exceed the retention
234 limit. It prefers to delete tasks with the oldest `Status` timestamps.