github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/scheduler.md (about)

     1  # Scheduler design
     2  
     3  This document covers the design and implementation details of the swarmkit
     4  scheduler.
     5  
     6  ## Overview
     7  
     8  In the SwarmKit [task model](task_model.md), tasks start in the `New` state,
     9  and advance to `Pending` once pre-scheduling activities like network allocation
    10  are done. The scheduler becomes responsible for tasks once they reach the
    11  `Pending` state. If the task can be scheduled, the scheduler schedules it
    12  immediately (subject to batching), and advances the state to `Assigned`. If it
    13  isn't possible to schedule the task immediately, for example, because no nodes
    14  have sufficient resources, the task will stay in the `Pending` state until it
    15  becomes possible to schedule it.
    16  
    17  When the state of a task reaches `Assigned`, the dispatcher sends this task to
    18  the assigned node to start the process of executing it.
    19  
    20  Each task will only pass through the scheduler once. Once a task is assigned to
    21  a node, this decision cannot be revisited. See the [task model](task_model.md)
    22  for more details on task lifecycle.
    23  
    24  ## Global service tasks
    25  
    26  Both replicated and global service tasks pass through the scheduler. For
    27  replicated tasks, the scheduler needs to decide which node the task should run
    28  on. For global service tasks, the job of the scheduler is considerably simpler,
    29  because the global orchestrator creates these tasks with the `NodeID` field
    30  already set. In this case, the scheduler only has to confirm that the node
    31  satisfies all the constraints and other filters, and once it does, advance the
    32  state to `Assigned`.
    33  
    34  ## Filters
    35  
    36  The scheduler needs to run several checks against each candidate node to make
    37  sure that node is suitable for running the task. At present, this includes the
    38  following set of checks:
    39  
    40  - Confirming the node is in the `Ready` state, as opposed to `Down` or
    41    `Disconnected` and availability is `Active`, as opposed to `Pause` or
    42    `Drain`
    43  - Confirming sufficient resource availability
    44  - Checking that all necessary plugins are installed on the node
    45  - Checking that user-specified constraints are satisfied
    46  - Checking that the node has the correct OS and architecture
    47  - Checking that host ports aren't used by an existing task on the node
    48  
    49  This operates through a mechanism called `Pipeline`. `Pipeline` chains together
    50  filters that perform these checks.
    51  
    52  Filters satisfy a simple interface. For simplicity, there is a `SetTask` method
    53  that lets a task be loaded into the filter and then checked against several
    54  candidate nodes. The `SetTask` method can do all the processing that only
    55  depends on the task and not on the node. This approach can save some redundant
    56  computation and/or allocations. `Filter` also has a `Check` method that tests
    57  the most-recently-loaded task against a candidate node, and an `Explain` method
    58  that provides a human-readable explanation of what an unsuccessful result from
    59  `Check` means. `Explain` is used to produce a message inside the task that
    60  explains what is preventing it from being scheduled.
    61  
    62  ## Scheduling algorithm
    63  
    64  The current scheduling algorithm works by building a tree of nodes which is
    65  specific to the service, and attempting to equalize the total number of tasks
    66  of this service below the branches of the tree at each level. This is done
    67  subject to constraints, so a node that, for example, doesn't have enough
    68  resources to accommodate more tasks, will end up with fewer than its peers.
    69  
    70  By default, this tree has only one level, and contains all suitable nodes at
    71  that level. When [placement preferences](topology.md) are specified, the tree
    72  can be customized to equalize the number of tasks across specific sets of
    73  nodes.
    74  
    75  While the primary scheduling criterion is the number of tasks from the same
    76  service on the node, the total number of tasks on the node is used as a
    77  tiebreaker. The first priority is spreading tasks from each service over as many
    78  nodes as possible, as evenly as possible, but when there's a choice between
    79  suitable nodes for the next task, preference is given to the node with the
    80  fewest total tasks. Note that this doesn't take into consideration things like
    81  resource reservations and actual resource usage, so this is an area where there
    82  may be a lot of room for future improvement.
    83  
    84  ## Batching
    85  
    86  The most expensive part of scheduling is building the tree described above. This
    87  is `O(# nodes)`. If there were `n` nodes and `t` tasks to be scheduled,
    88  scheduling those tasks independently would have `O(n*t)` runtime. We want to do
    89  better than this.
    90  
    91  A key optimization is that many tasks are effectively identical for the
    92  scheduler's purposes, being generated by the same service. For example, a
    93  replicated service with 1000 replicas will cause 1000 tasks to be created, but
    94  those tasks can be viewed as equivalent from the scheduler's perspective (until
    95  they are assigned nodes).
    96  
    97  If the scheduler can identify a group of identical tasks, it can build a single
    98  tree to be shared between them, instead of building a separate tree for each
    99  one. It does this using the combination of service ID and `SpecVersion`. If
   100  some number of tasks have the same service ID and `SpecVersion`, they get
   101  scheduled as a batch using a single tree.
   102  
   103  A slight complication with this is that the scheduler receives tasks one by one,
   104  over a watch channel. If it processed each task immediately, there would be no
   105  opportunities to group tasks and avoid redundant work. To solve this problem,
   106  the scheduler waits up to 50 ms after receiving a task, in hopes of receiving of
   107  another identical task. The total latency associated with this batching is
   108  limited to one second.
   109  
   110  ## Building and using the tree
   111  
   112  The tree starts out as a tree of max-heaps containing node objects. The primary
   113  sort criterion for the heaps is the number of tasks from the service in
   114  question running on the node. This provides easy access to the "worst"
   115  candidate node (i.e. the most tasks from that service).
   116  
   117  As an example, consider the following situation with nodes `N1`, `N2`, and `N3`,
   118  and services `S1` and `S2`:
   119  
   120  | node | S1 tasks | S2 tasks |       labels            |  
   121  |------|----------|----------|-------------------------|
   122  | `N1` |     1    |     1    | engine.labels.os=ubuntu |
   123  | `N2` |     1    |     0    | engine.labels.os=ubuntu |
   124  | `N3` |     0    |     1    | engine.labels.os=centos | 
   125  
   126  Suppose we want to scale up `S2` by adding one more task. If there are no
   127  placement preferences, the tree of max-heaps we generate in the context of `S2`
   128  only has a single heap, which looks like this:
   129  
   130  ```
   131                 N1      <--- "worst" node choice for S2
   132                /  \
   133              N2    N3
   134  ```
   135  
   136  Note that the above illustration shows a heap, not the tree that organizes the
   137  heaps. The heap has `N1` at the root because `N1` ties `N3` for number of `S2`
   138  tasks, but has more tasks in total. This makes `N1` the last-choice node to
   139  schedule an additional `S2` task.
   140  
   141  If there are placement preferences, the tree of heaps can contain multiple
   142  heaps. Here is an example with a preference to spread over `engine.label.os`:
   143  
   144  ```
   145            [root]
   146              / \
   147      "ubuntu"   "centos"
   148      max heap:   max heap:
   149        node1       node3
   150          |
   151        node2
   152  ```
   153  
   154  The scheduler iterates over the nodes, and checks if each one meets the
   155  constraints. If it does, it is added to the heap in the correct location in the
   156  tree. There is a maximum size for each heap, determined by the number of tasks
   157  being scheduled in the batch (since there is no outcome where more than `n`
   158  nodes are needed to schedule `n` tasks). If that maximum size gets reached for
   159  a certain heap, new nodes will displace the current "worst" node if they score
   160  better.
   161  
   162  After this process of populating the heaps, they are converted in-place to
   163  sorted lists, from minimum value (best node) to maximum value (worst node). The
   164  resulting tree of sorted node lists can be used to schedule the group of tasks
   165  by repeatedly choosing the branch with the fewest tasks from the service at
   166  each level. Since the branches in the tree (and the leaves) are sorted by the
   167  figure of merit, it is efficient to loop over these and "fill" them to the
   168  level of the next node in the list. If there are still tasks left over after
   169  doing a first pass, a round-robin approach is used to assign the tasks.
   170  
   171  ## Local state
   172  
   173  The scheduler tries to avoid querying the `MemoryStore`. Instead, it maintains
   174  information on all nodes and tasks in formats that are well-optimized for its
   175  purposes.
   176  
   177  A map called `allTasks` contains all tasks relevant to the scheduler, indexed by
   178  ID. In principle this is similar to calling `store.GetTask`, but is more
   179  efficient. The map is kept up to date through events from the store.
   180  
   181  A `nodeSet` struct wraps a map that contains information on each node, indexed
   182  by the node ID. In addition to the `Node` structure itself, this includes some
   183  calculated information that's useful to the scheduler, such as the total number
   184  of tasks, the number of tasks by service, a tally of the available resources,
   185  and the set of host ports that are taken on that node.
   186  
   187  ## Detecting faulty nodes
   188  
   189  A possible problem with the original scheduler was that it might assign tasks to
   190  a misbehaving node indefinitely. If a certain node is unable to successfully run
   191  tasks, it will always look like the least loaded from the scheduler's
   192  perspective, and be the favorite for task assignments. But this could result in
   193  a failure loop where tasks could never get assigned on a node where they would
   194  actually run successfully.
   195  
   196  To handle this situation, the scheduler tracks failures of each service by node.
   197  If a service fails several times on any given node within a certain time
   198  interval, that node is marked as potentially faulty for the service. The sort
   199  comparator that determines which nodes are best for scheduling the service
   200  (normally the nodes with the fewest instances of that service) sorts any node
   201  that has been marked potentially faulty as among the last possible choices for
   202  scheduling that service.