volcano.sh/volcano@v1.9.0/docs/design/job-resource-reservation-design.md

volcano.sh/volcano@v1.9.0/docs/design/job-resource-reservation-design.md (about)

1 # Volcano Resource Reservation For Target Jobs
2
3 @[Thor-wl](https://github.com/Thor-wl); Aug 19th, 2020
4
5 ## Motivation
6 As [issue 13](https://github.com/volcano-sh/volcano/issues/13) / [issue 748](https://github.com/volcano-sh/volcano/issues/748)
7 / [issue 947](https://github.com/volcano-sh/volcano/issues/947) mentioned, current scheduler strategy may result in
8 jobs starvation. Consider two classical scenes:
9 * Suppose there is insufficient resource in cluster and both Job A and Job B are to be scheduled. Job A and Job B are in
10 equal priority while Job A request more resources. Under current schedule strategy, there is high probability that Job B
11 can be scheduled first while Job A will be pending for a long time. If more jobs requesting less resource comes later,
12 Job A will get a smaller chance to be scheduled.
13 * Suppose cluster resource is insufficient, Job A has higher priority and requests more resource while Job B has lower
14 priority but request less resource. As current schedule strategy works, volcano will schedule Job B first. What's worst,
15 Job A will keep waiting until enough resources are released by some low priority jobs.
16
17 ## Consideration
18 ### How to recognise target jobs?
19 There are two ways to pick out target jobs:
20 #### Request resources
21 Set standard lines on some conditions such as request resources. Jobs requesting more resources than standard line will
22 be regarded as target jobs. It may be a good way for specific scenarios such as ML training/big data/scientific computing,
23 etc. However, users need to be experienced with his/her job requirements.
24 #### Waiting time
25 Consider waiting time as target job judgement is another solution. Jobs waiting for longer time are more likely to be
26 target jobs, especially on condition that jobs are blocked because of starvation. Different from setting standard lines,
27 order jobs by waiting time is a good idea because it recognises target job automatically.
28 ### How to reserve resources for target jobs?
29 Following are the factors taking into consideration for resources reservation.
30 #### Resource amount
31 Absolutely, jobs requiring resources more than cluster total amount cannot be satisfied. When choose nodes which need to
32 reserve resources for target jobs, the total amount idle resources of the selected nodes should as closer as the requirement
33 because only in this way can we need the least amount resources for jobs to be finished in most scenes.
34 #### Selected nodes lock
35 Nodes which are chosen to reserve resources should be locked. That means these nodes cannot accept any other jobs until
36 target jobs are scheduled.
37 #### Selected nodes numbers
38 Another problem is how many nodes can be selected as Reservation Nodes. In essence, it's a problem to balance scheduling
39 performance and reservation requirement.
40 #### The biggest challenge: unpredictable completion time of running jobs in selected nodes
41 Uncertainty of completion time of running jobs in selected nodes makes it difficult to find the optimal solution for
42 meeting the requirement of target jobs. Though idle resources in selected nodes satisfied target jobs most, there's no
43 guarantee that the waiting time for extra resource taken in running jobs is the shortest. In some cases, it may be a
44 suboptimal solution.
45 ### How to balance priority and waiting time?
46 Priority is more important than waiting time.
47 * No matter how many resources high-priority jobs requests and how much time they have already waited for, they should be
48 scheduled first.
49 * When jobs are at same priority but waiting time differs, job which waits for the longest time should be scheduled first.
50
51 ## Design
52 ### Target Job Recognition
53 As volcano is a general platform, we tend to support both custom mode and automation mode to recognize target jobs.
54 #### Custom Mode
55 Users can set **request resource** or **waiting duration** as standard. Jobs which request resources more than settings or
56 wait longer than standard line will be treated as potential target job. Volcano will choose the target job which has the
57 highest priority and above the standard line most as the target job. Another strategy is to allow users specify target job
58 by set some specified annotations.
59 #### Automation Mode
60 If not config standard line, volcano will order jobs to be scheduled in session by priority and waiting time. The job
61 with the highest priority and waiting for the longest time will be selected as the target job. Volcano scheduler will
62 check if there is a target job selected in each session. Otherwise, volcano will select a target job according to the
63 strategy above.
64 ### Locked Nodes
65 As job consists of some tasks and each task corresponds to a pod, scheduler will select a series of nodes which can satisfy
66 these pods. These nodes will be locked and no pod can be scheduled to them until the target job is scheduled.There are
67 three schemes as follows:
68 #### Cluster Lock
69 In order to schedule target job as soon as possible, lock all nodes in cluster to reserve resource for it. This scheme is
70 suitable for task type with fast throughput. As to long-running task, scheduler performance will be severely degraded.
71 #### Multi-Node Lock
72 In order to balance scheduler performance and resource reservation, we can lock part of nodes as locked nodes. Sort all
73 nodes by idle resource amount and select N nodes whose idle resource are the most. The N can be a fixed number or percentage.
74 Make sure the potential available resource can satisfy the request of target job. Then check if the available resource meets
75 the target job's demand after exist tasks finishing in locked nodes releases resource every scheduling cycle. This scheme is
76 suitable for users who are very experienced with his/her usage scenarios and the job type is almost the same(the run-time
77 gap of tasks is not too large).
78 #### Single-Node Lock
79 Another way to lock N nodes is lock one node every scheduling cycle. The selected node has the most idle resource at that
80 cycle. The dynamic selection process can alleviate the stereotype caused by one-time selection, especially in scene that
81 task type is unpredictable and complex.
82
83 ![Feature Design](./images/reservation_design.png)
84
85 ## Implementation(v1.1.0)
86 Volcano v1.1.0 has implemented that recognise target job and reserve resource automatically.
87 ### Action
88 Add two new action: elect and reserve. Elect action aims to find the target job. Reserve action is responsible for select
89 locked nodes.
90 #### Elect
91 If no target job is elected, select one from jobs whose podgroup is `pending` and satisfy conditions in `TargetJob`
92 function registered in session object.
93 #### Reserve
94 If target job exists and is not ready, reserve nodes for it according algorithm implemented in `ReservedNodes` function
95 registered in session object.
96 ### Plugin
97 Add new Plugin reservation to implement algorithm detail about selecting target job and reserving nodes. `targetJobFn`
98 selects job whose priority is the highest and waits for the longest time. `reservedNodesFn` reserve node whose idle resource
99 is the max in every session.
100 ![Workflow](./images/reservation_workflow.png)
101 ### Recommend practice
102 An example how to make use of this feature is to configure scheduler's configuration as follows:
103 ```yaml
104 actions: "enqueue, elect, allocate, backfill, reserve"
105 configurations:
106 - name: enqueue
107 arguments:
108 "overcommit-factor": 1.0
109 tiers:
110 - plugins:
111 - name: priority
112 - name: gang
113 - name: conformance
114 - name: reservation
115 - plugins:
116 - name: drf
117 - name: predicates
118 - name: proportion
119 - name: nodeorder
120 - name: binpack
121 args:
122 ```
123 Note:
124 * `elect` must be configured between `enqueue` and `allocate`
125 * `reserve` must be behind `allocate`
126 * You'd better config `overcommit-factor` to `1.0` which is `1.2` by default for it may not select the most suitable
127 target job if not configure like this.
128
129 ### TODO
130 * support custom define percentage of cluster nodes as the upper limit of locked nodes number, which should be in the
131 form of pure decimal. Default value is 1.0.
132 * support custom define wait duration, whose default value is 0.
133
134 ## Implementation(v1.2.1)
135 Optimized from feature implementation in v1.1.0, volcano v1.2.1 will update as follows:
136 * New recommend action order: elect, enqueue, reserve, allocate. This order will lead to reserve nodes efficiently.
137 * Support new nodes lock algorithm: cluster lock. Namely, lock all nodes for target job at once. It may result in shorter
138 scheduling time for target job. Custom configuration is supported for users to choose which lock algorithm via scheduler
139 configuration file or plugin.
140 * create plugin to cover the logic about `inqueue` status judgement in `enqueue` action. It will be more flexible to set
141 `inqueue` condition.
142 * Support users specify target job by set specified annotation on job. If not specified, volcano will select one and set
143 annotation automatically.
144 * Lock nodes will be labelled instead of stored in cache. Label will be removed after target job is running.
145 * Select part nodes as lock nodes by default. Support users set `timeout` for max lock duration. If target job is not
146 allocated resource out of `timeout`, cluster lock will work.
147 * Put `nodes` as an argument for `Predicate` function
148 * Select target job from jobs whose podgroup is `pending` and `inqueue`.