github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/scheduler.md (about) 1 # Scheduler design 2 3 This document covers the design and implementation details of the swarmkit 4 scheduler. 5 6 ## Overview 7 8 In the SwarmKit [task model](task_model.md), tasks start in the `New` state, 9 and advance to `Pending` once pre-scheduling activities like network allocation 10 are done. The scheduler becomes responsible for tasks once they reach the 11 `Pending` state. If the task can be scheduled, the scheduler schedules it 12 immediately (subject to batching), and advances the state to `Assigned`. If it 13 isn't possible to schedule the task immediately, for example, because no nodes 14 have sufficient resources, the task will stay in the `Pending` state until it 15 becomes possible to schedule it. 16 17 When the state of a task reaches `Assigned`, the dispatcher sends this task to 18 the assigned node to start the process of executing it. 19 20 Each task will only pass through the scheduler once. Once a task is assigned to 21 a node, this decision cannot be revisited. See the [task model](task_model.md) 22 for more details on task lifecycle. 23 24 ## Global service tasks 25 26 Both replicated and global service tasks pass through the scheduler. For 27 replicated tasks, the scheduler needs to decide which node the task should run 28 on. For global service tasks, the job of the scheduler is considerably simpler, 29 because the global orchestrator creates these tasks with the `NodeID` field 30 already set. In this case, the scheduler only has to confirm that the node 31 satisfies all the constraints and other filters, and once it does, advance the 32 state to `Assigned`. 33 34 ## Filters 35 36 The scheduler needs to run several checks against each candidate node to make 37 sure that node is suitable for running the task. At present, this includes the 38 following set of checks: 39 40 - Confirming the node is in the `Ready` state, as opposed to `Down` or 41 `Disconnected` and availability is `Active`, as opposed to `Pause` or 42 `Drain` 43 - Confirming sufficient resource availability 44 - Checking that all necessary plugins are installed on the node 45 - Checking that user-specified constraints are satisfied 46 - Checking that the node has the correct OS and architecture 47 - Checking that host ports aren't used by an existing task on the node 48 49 This operates through a mechanism called `Pipeline`. `Pipeline` chains together 50 filters that perform these checks. 51 52 Filters satisfy a simple interface. For simplicity, there is a `SetTask` method 53 that lets a task be loaded into the filter and then checked against several 54 candidate nodes. The `SetTask` method can do all the processing that only 55 depends on the task and not on the node. This approach can save some redundant 56 computation and/or allocations. `Filter` also has a `Check` method that tests 57 the most-recently-loaded task against a candidate node, and an `Explain` method 58 that provides a human-readable explanation of what an unsuccessful result from 59 `Check` means. `Explain` is used to produce a message inside the task that 60 explains what is preventing it from being scheduled. 61 62 ## Scheduling algorithm 63 64 The current scheduling algorithm works by building a tree of nodes which is 65 specific to the service, and attempting to equalize the total number of tasks 66 of this service below the branches of the tree at each level. This is done 67 subject to constraints, so a node that, for example, doesn't have enough 68 resources to accommodate more tasks, will end up with fewer than its peers. 69 70 By default, this tree has only one level, and contains all suitable nodes at 71 that level. When [placement preferences](topology.md) are specified, the tree 72 can be customized to equalize the number of tasks across specific sets of 73 nodes. 74 75 While the primary scheduling criterion is the number of tasks from the same 76 service on the node, the total number of tasks on the node is used as a 77 tiebreaker. The first priority is spreading tasks from each service over as many 78 nodes as possible, as evenly as possible, but when there's a choice between 79 suitable nodes for the next task, preference is given to the node with the 80 fewest total tasks. Note that this doesn't take into consideration things like 81 resource reservations and actual resource usage, so this is an area where there 82 may be a lot of room for future improvement. 83 84 ## Batching 85 86 The most expensive part of scheduling is building the tree described above. This 87 is `O(# nodes)`. If there were `n` nodes and `t` tasks to be scheduled, 88 scheduling those tasks independently would have `O(n*t)` runtime. We want to do 89 better than this. 90 91 A key optimization is that many tasks are effectively identical for the 92 scheduler's purposes, being generated by the same service. For example, a 93 replicated service with 1000 replicas will cause 1000 tasks to be created, but 94 those tasks can be viewed as equivalent from the scheduler's perspective (until 95 they are assigned nodes). 96 97 If the scheduler can identify a group of identical tasks, it can build a single 98 tree to be shared between them, instead of building a separate tree for each 99 one. It does this using the combination of service ID and `SpecVersion`. If 100 some number of tasks have the same service ID and `SpecVersion`, they get 101 scheduled as a batch using a single tree. 102 103 A slight complication with this is that the scheduler receives tasks one by one, 104 over a watch channel. If it processed each task immediately, there would be no 105 opportunities to group tasks and avoid redundant work. To solve this problem, 106 the scheduler waits up to 50 ms after receiving a task, in hopes of receiving of 107 another identical task. The total latency associated with this batching is 108 limited to one second. 109 110 ## Building and using the tree 111 112 The tree starts out as a tree of max-heaps containing node objects. The primary 113 sort criterion for the heaps is the number of tasks from the service in 114 question running on the node. This provides easy access to the "worst" 115 candidate node (i.e. the most tasks from that service). 116 117 As an example, consider the following situation with nodes `N1`, `N2`, and `N3`, 118 and services `S1` and `S2`: 119 120 | node | S1 tasks | S2 tasks | labels | 121 |------|----------|----------|-------------------------| 122 | `N1` | 1 | 1 | engine.labels.os=ubuntu | 123 | `N2` | 1 | 0 | engine.labels.os=ubuntu | 124 | `N3` | 0 | 1 | engine.labels.os=centos | 125 126 Suppose we want to scale up `S2` by adding one more task. If there are no 127 placement preferences, the tree of max-heaps we generate in the context of `S2` 128 only has a single heap, which looks like this: 129 130 ``` 131 N1 <--- "worst" node choice for S2 132 / \ 133 N2 N3 134 ``` 135 136 Note that the above illustration shows a heap, not the tree that organizes the 137 heaps. The heap has `N1` at the root because `N1` ties `N3` for number of `S2` 138 tasks, but has more tasks in total. This makes `N1` the last-choice node to 139 schedule an additional `S2` task. 140 141 If there are placement preferences, the tree of heaps can contain multiple 142 heaps. Here is an example with a preference to spread over `engine.label.os`: 143 144 ``` 145 [root] 146 / \ 147 "ubuntu" "centos" 148 max heap: max heap: 149 node1 node3 150 | 151 node2 152 ``` 153 154 The scheduler iterates over the nodes, and checks if each one meets the 155 constraints. If it does, it is added to the heap in the correct location in the 156 tree. There is a maximum size for each heap, determined by the number of tasks 157 being scheduled in the batch (since there is no outcome where more than `n` 158 nodes are needed to schedule `n` tasks). If that maximum size gets reached for 159 a certain heap, new nodes will displace the current "worst" node if they score 160 better. 161 162 After this process of populating the heaps, they are converted in-place to 163 sorted lists, from minimum value (best node) to maximum value (worst node). The 164 resulting tree of sorted node lists can be used to schedule the group of tasks 165 by repeatedly choosing the branch with the fewest tasks from the service at 166 each level. Since the branches in the tree (and the leaves) are sorted by the 167 figure of merit, it is efficient to loop over these and "fill" them to the 168 level of the next node in the list. If there are still tasks left over after 169 doing a first pass, a round-robin approach is used to assign the tasks. 170 171 ## Local state 172 173 The scheduler tries to avoid querying the `MemoryStore`. Instead, it maintains 174 information on all nodes and tasks in formats that are well-optimized for its 175 purposes. 176 177 A map called `allTasks` contains all tasks relevant to the scheduler, indexed by 178 ID. In principle this is similar to calling `store.GetTask`, but is more 179 efficient. The map is kept up to date through events from the store. 180 181 A `nodeSet` struct wraps a map that contains information on each node, indexed 182 by the node ID. In addition to the `Node` structure itself, this includes some 183 calculated information that's useful to the scheduler, such as the total number 184 of tasks, the number of tasks by service, a tally of the available resources, 185 and the set of host ports that are taken on that node. 186 187 ## Detecting faulty nodes 188 189 A possible problem with the original scheduler was that it might assign tasks to 190 a misbehaving node indefinitely. If a certain node is unable to successfully run 191 tasks, it will always look like the least loaded from the scheduler's 192 perspective, and be the favorite for task assignments. But this could result in 193 a failure loop where tasks could never get assigned on a node where they would 194 actually run successfully. 195 196 To handle this situation, the scheduler tracks failures of each service by node. 197 If a service fails several times on any given node within a certain time 198 interval, that node is marked as potentially faulty for the service. The sort 199 comparator that determines which nodes are best for scheduling the service 200 (normally the nodes with the fewest instances of that service) sorts any node 201 that has been marked potentially faulty as among the last possible choices for 202 scheduling that service.