github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/topology.md (about) 1 # Topology aware scheduling 2 3 ## Background 4 5 There is often interest in making the scheduler aware of factors such as 6 availability zones. This document specifies a generic way to customize scheduler 7 behavior based on labels attached to nodes. 8 9 ## Approach 10 11 The scheduler consults a repeated field named `Preferences` under `Placement` 12 when it places tasks. These "placement preferences" are be listed in 13 decreasing order of precedence, and have higher precedence than the default 14 scheduler logic. 15 16 These placement preferences are be interpreted based on their types, but the 17 initially supported "spread over" message tells the scheduler to spread tasks 18 evenly between nodes which have each distinct value of the referenced node or 19 engine label. 20 21 ## Protobuf definitions 22 23 In the `Placement` message under `TaskSpec`, we define a repeated field called 24 `Preferences`. 25 26 ``` 27 repeated PlacementPreference preferences = 2; 28 ``` 29 30 `PlacementPreference` is a message that specifies how to act on a label. 31 The initially supported preference would is "spread". 32 33 ``` 34 message SpreadOver { 35 string spread_descriptor = 1; // label descriptor, such as engine.labels.az 36 // TODO: support node information beyond engine and node labels 37 38 // TODO: in the future, add a map that provides weights for weighted 39 // spreading. 40 } 41 42 message PlacementPreference { 43 oneof Preference { 44 SpreadOver spread = 1; 45 } 46 47 Preference pref = 1; 48 } 49 ``` 50 51 ## Behavior 52 53 A simple use of this feature would be to spread tasks evenly between multiple 54 availability zones. The way to do this would be to create an engine label on 55 each node indicating its availability zone, and then create a 56 `PlacementPreference` with type `SpreadOver` which references the engine label. 57 The scheduler would prioritize balance between the availability zones, and if 58 it ever has a choice between multiple nodes in the preferred availability zone 59 (or a tie between AZs), it would choose the node based on its built-in logic. 60 As of Docker 1.13, this logic will prefer to schedule a task on the node which 61 has the fewest tasks associated with the particular service. 62 63 A slightly more complicated use case involves hierarchical topology. Say there 64 are two datacenters, which each have four rows, each row having 20 racks. To 65 spread tasks evenly at each of these levels, there could be three `SpreadOver` 66 messages in `Preferences`. The first would spread over datacenters, the second 67 would spread over rows, and the third would spread over racks. This ensures that 68 the highest precedence goes to spreading tasks between datacenters, but after 69 that, tasks are evenly distributed between rows and then racks. 70 71 Nodes that are missing the label used by `SpreadOver` will still receive task 72 assignments. As a group, they will receive tasks in equal proportion to any of 73 the other groups identified by a specific label value. In a sense, a missing 74 label is the same as having the label with a null value attached to it. If the 75 service should only run on nodes with the label being used for the `SpreadOver` 76 preference, the preference should be combined with a constraint. 77 78 ## Future enhancements 79 80 - In addition to SpreadOver, we could add a PackInto with opposite behavior. It 81 would try to locate tasks on nodes that share the same label value as other 82 tasks, subject to constraints. By combining multiple SpreadOver and PackInto 83 preferences, it would be possible to do things like spread over datacenters 84 and then pack into racks within those datacenters. 85 86 - Support weighted spreading, i.e. for situations where one datacenter has more 87 servers than another. This could be done by adding a map to SpreadOver 88 containing weights for each label value. 89 90 - Support acting on items other than node labels and engine labels. For example, 91 acting on node IDs to spread or pack over individual nodes, or on resource 92 specifications to implement soft resource constraints.