github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/topology.md

github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/topology.md (about)

1 # Topology aware scheduling
2
3 ## Background
4
5 There is often interest in making the scheduler aware of factors such as
6 availability zones. This document specifies a generic way to customize scheduler
7 behavior based on labels attached to nodes.
8
9 ## Approach
10
11 The scheduler consults a repeated field named `Preferences` under `Placement`
12 when it places tasks. These "placement preferences" are be listed in
13 decreasing order of precedence, and have higher precedence than the default
14 scheduler logic.
15
16 These placement preferences are be interpreted based on their types, but the
17 initially supported "spread over" message tells the scheduler to spread tasks
18 evenly between nodes which have each distinct value of the referenced node or
19 engine label.
20
21 ## Protobuf definitions
22
23 In the `Placement` message under `TaskSpec`, we define a repeated field called
24 `Preferences`.
25
26 ```
27 repeated PlacementPreference preferences = 2;
28 ```
29
30 `PlacementPreference` is a message that specifies how to act on a label.
31 The initially supported preference would is "spread".
32
33 ```
34 message SpreadOver {
35 string spread_descriptor = 1; // label descriptor, such as engine.labels.az
36 // TODO: support node information beyond engine and node labels
37
38 // TODO: in the future, add a map that provides weights for weighted
39 // spreading.
40 }
41
42 message PlacementPreference {
43 oneof Preference {
44 SpreadOver spread = 1;
45 }
46
47 Preference pref = 1;
48 }
49 ```
50
51 ## Behavior
52
53 A simple use of this feature would be to spread tasks evenly between multiple
54 availability zones. The way to do this would be to create an engine label on
55 each node indicating its availability zone, and then create a
56 `PlacementPreference` with type `SpreadOver` which references the engine label.
57 The scheduler would prioritize balance between the availability zones, and if
58 it ever has a choice between multiple nodes in the preferred availability zone
59 (or a tie between AZs), it would choose the node based on its built-in logic.
60 As of Docker 1.13, this logic will prefer to schedule a task on the node which
61 has the fewest tasks associated with the particular service.
62
63 A slightly more complicated use case involves hierarchical topology. Say there
64 are two datacenters, which each have four rows, each row having 20 racks. To
65 spread tasks evenly at each of these levels, there could be three `SpreadOver`
66 messages in `Preferences`. The first would spread over datacenters, the second
67 would spread over rows, and the third would spread over racks. This ensures that
68 the highest precedence goes to spreading tasks between datacenters, but after
69 that, tasks are evenly distributed between rows and then racks.
70
71 Nodes that are missing the label used by `SpreadOver` will still receive task
72 assignments. As a group, they will receive tasks in equal proportion to any of
73 the other groups identified by a specific label value. In a sense, a missing
74 label is the same as having the label with a null value attached to it. If the
75 service should only run on nodes with the label being used for the `SpreadOver`
76 preference, the preference should be combined with a constraint.
77
78 ## Future enhancements
79
80 - In addition to SpreadOver, we could add a PackInto with opposite behavior. It
81 would try to locate tasks on nodes that share the same label value as other
82 tasks, subject to constraints. By combining multiple SpreadOver and PackInto
83 preferences, it would be possible to do things like spread over datacenters
84 and then pack into racks within those datacenters.
85
86 - Support weighted spreading, i.e. for situations where one datacenter has more
87 servers than another. This could be done by adding a map to SpreadOver
88 containing weights for each label value.
89
90 - Support acting on items other than node labels and engine labels. For example,
91 acting on node IDs to spread or pack over individual nodes, or on resource
92 specifications to implement soft resource constraints.