volcano.sh/volcano@v1.9.0/docs/design/job-resource-reservation-design.md (about) 1 # Volcano Resource Reservation For Target Jobs 2 3 @[Thor-wl](https://github.com/Thor-wl); Aug 19th, 2020 4 5 ## Motivation 6 As [issue 13](https://github.com/volcano-sh/volcano/issues/13) / [issue 748](https://github.com/volcano-sh/volcano/issues/748) 7 / [issue 947](https://github.com/volcano-sh/volcano/issues/947) mentioned, current scheduler strategy may result in 8 jobs starvation. Consider two classical scenes: 9 * Suppose there is insufficient resource in cluster and both Job A and Job B are to be scheduled. Job A and Job B are in 10 equal priority while Job A request more resources. Under current schedule strategy, there is high probability that Job B 11 can be scheduled first while Job A will be pending for a long time. If more jobs requesting less resource comes later, 12 Job A will get a smaller chance to be scheduled. 13 * Suppose cluster resource is insufficient, Job A has higher priority and requests more resource while Job B has lower 14 priority but request less resource. As current schedule strategy works, volcano will schedule Job B first. What's worst, 15 Job A will keep waiting until enough resources are released by some low priority jobs. 16 17 ## Consideration 18 ### How to recognise target jobs? 19 There are two ways to pick out target jobs: 20 #### Request resources 21 Set standard lines on some conditions such as request resources. Jobs requesting more resources than standard line will 22 be regarded as target jobs. It may be a good way for specific scenarios such as ML training/big data/scientific computing, 23 etc. However, users need to be experienced with his/her job requirements. 24 #### Waiting time 25 Consider waiting time as target job judgement is another solution. Jobs waiting for longer time are more likely to be 26 target jobs, especially on condition that jobs are blocked because of starvation. Different from setting standard lines, 27 order jobs by waiting time is a good idea because it recognises target job automatically. 28 ### How to reserve resources for target jobs? 29 Following are the factors taking into consideration for resources reservation. 30 #### Resource amount 31 Absolutely, jobs requiring resources more than cluster total amount cannot be satisfied. When choose nodes which need to 32 reserve resources for target jobs, the total amount idle resources of the selected nodes should as closer as the requirement 33 because only in this way can we need the least amount resources for jobs to be finished in most scenes. 34 #### Selected nodes lock 35 Nodes which are chosen to reserve resources should be locked. That means these nodes cannot accept any other jobs until 36 target jobs are scheduled. 37 #### Selected nodes numbers 38 Another problem is how many nodes can be selected as Reservation Nodes. In essence, it's a problem to balance scheduling 39 performance and reservation requirement. 40 #### The biggest challenge: unpredictable completion time of running jobs in selected nodes 41 Uncertainty of completion time of running jobs in selected nodes makes it difficult to find the optimal solution for 42 meeting the requirement of target jobs. Though idle resources in selected nodes satisfied target jobs most, there's no 43 guarantee that the waiting time for extra resource taken in running jobs is the shortest. In some cases, it may be a 44 suboptimal solution. 45 ### How to balance priority and waiting time? 46 Priority is more important than waiting time. 47 * No matter how many resources high-priority jobs requests and how much time they have already waited for, they should be 48 scheduled first. 49 * When jobs are at same priority but waiting time differs, job which waits for the longest time should be scheduled first. 50 51 ## Design 52 ### Target Job Recognition 53 As volcano is a general platform, we tend to support both custom mode and automation mode to recognize target jobs. 54 #### Custom Mode 55 Users can set **request resource** or **waiting duration** as standard. Jobs which request resources more than settings or 56 wait longer than standard line will be treated as potential target job. Volcano will choose the target job which has the 57 highest priority and above the standard line most as the target job. Another strategy is to allow users specify target job 58 by set some specified annotations. 59 #### Automation Mode 60 If not config standard line, volcano will order jobs to be scheduled in session by priority and waiting time. The job 61 with the highest priority and waiting for the longest time will be selected as the target job. Volcano scheduler will 62 check if there is a target job selected in each session. Otherwise, volcano will select a target job according to the 63 strategy above. 64 ### Locked Nodes 65 As job consists of some tasks and each task corresponds to a pod, scheduler will select a series of nodes which can satisfy 66 these pods. These nodes will be locked and no pod can be scheduled to them until the target job is scheduled.There are 67 three schemes as follows: 68 #### Cluster Lock 69 In order to schedule target job as soon as possible, lock all nodes in cluster to reserve resource for it. This scheme is 70 suitable for task type with fast throughput. As to long-running task, scheduler performance will be severely degraded. 71 #### Multi-Node Lock 72 In order to balance scheduler performance and resource reservation, we can lock part of nodes as locked nodes. Sort all 73 nodes by idle resource amount and select N nodes whose idle resource are the most. The N can be a fixed number or percentage. 74 Make sure the potential available resource can satisfy the request of target job. Then check if the available resource meets 75 the target job's demand after exist tasks finishing in locked nodes releases resource every scheduling cycle. This scheme is 76 suitable for users who are very experienced with his/her usage scenarios and the job type is almost the same(the run-time 77 gap of tasks is not too large). 78 #### Single-Node Lock 79 Another way to lock N nodes is lock one node every scheduling cycle. The selected node has the most idle resource at that 80 cycle. The dynamic selection process can alleviate the stereotype caused by one-time selection, especially in scene that 81 task type is unpredictable and complex. 82 83  84 85 ## Implementation(v1.1.0) 86 Volcano v1.1.0 has implemented that recognise target job and reserve resource automatically. 87 ### Action 88 Add two new action: elect and reserve. Elect action aims to find the target job. Reserve action is responsible for select 89 locked nodes. 90 #### Elect 91 If no target job is elected, select one from jobs whose podgroup is `pending` and satisfy conditions in `TargetJob` 92 function registered in session object. 93 #### Reserve 94 If target job exists and is not ready, reserve nodes for it according algorithm implemented in `ReservedNodes` function 95 registered in session object. 96 ### Plugin 97 Add new Plugin reservation to implement algorithm detail about selecting target job and reserving nodes. `targetJobFn` 98 selects job whose priority is the highest and waits for the longest time. `reservedNodesFn` reserve node whose idle resource 99 is the max in every session. 100  101 ### Recommend practice 102 An example how to make use of this feature is to configure scheduler's configuration as follows: 103 ```yaml 104 actions: "enqueue, elect, allocate, backfill, reserve" 105 configurations: 106 - name: enqueue 107 arguments: 108 "overcommit-factor": 1.0 109 tiers: 110 - plugins: 111 - name: priority 112 - name: gang 113 - name: conformance 114 - name: reservation 115 - plugins: 116 - name: drf 117 - name: predicates 118 - name: proportion 119 - name: nodeorder 120 - name: binpack 121 args: 122 ``` 123 Note: 124 * `elect` must be configured between `enqueue` and `allocate` 125 * `reserve` must be behind `allocate` 126 * You'd better config `overcommit-factor` to `1.0` which is `1.2` by default for it may not select the most suitable 127 target job if not configure like this. 128 129 ### TODO 130 * support custom define percentage of cluster nodes as the upper limit of locked nodes number, which should be in the 131 form of pure decimal. Default value is 1.0. 132 * support custom define wait duration, whose default value is 0. 133 134 ## Implementation(v1.2.1) 135 Optimized from feature implementation in v1.1.0, volcano v1.2.1 will update as follows: 136 * New recommend action order: elect, enqueue, reserve, allocate. This order will lead to reserve nodes efficiently. 137 * Support new nodes lock algorithm: cluster lock. Namely, lock all nodes for target job at once. It may result in shorter 138 scheduling time for target job. Custom configuration is supported for users to choose which lock algorithm via scheduler 139 configuration file or plugin. 140 * create plugin to cover the logic about `inqueue` status judgement in `enqueue` action. It will be more flexible to set 141 `inqueue` condition. 142 * Support users specify target job by set specified annotation on job. If not specified, volcano will select one and set 143 annotation automatically. 144 * Lock nodes will be labelled instead of stored in cache. Label will be removed after target job is running. 145 * Select part nodes as lock nodes by default. Support users set `timeout` for max lock duration. If target job is not 146 allocated resource out of `timeout`, cluster lock will work. 147 * Put `nodes` as an argument for `Predicate` function 148 * Select target job from jobs whose podgroup is `pending` and `inqueue`.