volcano.sh/volcano@v1.9.0/docs/design/job-resource-reservation-design.md (about)

     1  # Volcano Resource Reservation For Target Jobs
     2  
     3  @[Thor-wl](https://github.com/Thor-wl); Aug 19th, 2020
     4  
     5  ## Motivation
     6  As [issue 13](https://github.com/volcano-sh/volcano/issues/13) / [issue 748](https://github.com/volcano-sh/volcano/issues/748) 
     7  / [issue 947](https://github.com/volcano-sh/volcano/issues/947) mentioned, current scheduler strategy may result in 
     8  jobs starvation. Consider two classical scenes:
     9  * Suppose there is insufficient resource in cluster and both Job A and Job B are to be scheduled. Job A and Job B are in 
    10  equal priority while Job A request more resources. Under current schedule strategy, there is high probability that Job B
    11  can be scheduled first while Job A will be pending for a long time. If more jobs requesting less resource comes later, 
    12  Job A will get a smaller chance to be scheduled.
    13  * Suppose cluster resource is insufficient, Job A has higher priority and requests more resource while Job B has lower
    14  priority but request less resource. As current schedule strategy works, volcano will schedule Job B first. What's worst,
    15  Job A will keep waiting until enough resources are released by some low priority jobs. 
    16  
    17  ## Consideration
    18  ### How to recognise target jobs?
    19  There are two ways to pick out target jobs:
    20  #### Request resources
    21  Set standard lines on some conditions such as request resources. Jobs requesting more resources than standard line will 
    22  be regarded as target jobs. It may be a good way for specific scenarios such as ML training/big data/scientific computing, 
    23  etc. However, users need to be experienced with his/her job requirements.
    24  #### Waiting time
    25  Consider waiting time as target job judgement is another solution. Jobs waiting for longer time are more likely to be
    26  target jobs, especially on condition that jobs are blocked because of starvation. Different from setting standard lines, 
    27  order jobs by waiting time is a good idea because it recognises target job automatically.
    28  ### How to reserve resources for target jobs?
    29  Following are the factors taking into consideration for resources reservation.
    30  #### Resource amount
    31  Absolutely, jobs requiring resources more than cluster total amount cannot be satisfied. When choose nodes which need to
    32  reserve resources for target jobs, the total amount idle resources of the selected nodes should as closer as the requirement
    33  because only in this way can we need the least amount resources for jobs to be finished in most scenes. 
    34  #### Selected nodes lock
    35  Nodes which are chosen to reserve resources should be locked. That means these nodes cannot accept any other jobs until 
    36  target jobs are scheduled.
    37  #### Selected nodes numbers
    38  Another problem is how many nodes can be selected as Reservation Nodes. In essence, it's a problem to balance scheduling
    39  performance and reservation requirement.
    40  #### The biggest challenge: unpredictable completion time of running jobs in selected nodes
    41  Uncertainty of completion time of running jobs in selected nodes makes it difficult to find the optimal solution for 
    42  meeting the requirement of target jobs. Though idle resources in selected nodes satisfied target jobs most, there's no 
    43  guarantee that the waiting time for extra resource taken in running jobs is the shortest. In some cases, it may be a 
    44  suboptimal solution.
    45  ### How to balance priority and waiting time?
    46  Priority is more important than waiting time.
    47  * No matter how many resources high-priority jobs requests and how much time they have already waited for, they should be
    48  scheduled first.
    49  * When jobs are at same priority but waiting time differs, job which waits for the longest time should be scheduled first.
    50  
    51  ## Design
    52  ### Target Job Recognition
    53  As volcano is a general platform, we tend to support both custom mode and automation mode to recognize target jobs.
    54  #### Custom Mode
    55  Users can set **request resource** or **waiting duration** as standard. Jobs which request resources more than settings or 
    56  wait longer than standard line will be treated as potential target job. Volcano will choose the target job which has the 
    57  highest priority and above the standard line most as the target job. Another strategy is to allow users specify target job
    58  by set some specified annotations.
    59  #### Automation Mode
    60  If not config standard line, volcano will order jobs to be scheduled in session by priority and waiting time. The job 
    61  with the highest priority and waiting for the longest time will be selected as the target job. Volcano scheduler will 
    62  check if there is a target job selected in each session. Otherwise, volcano will select a target job according to the 
    63  strategy above.
    64  ### Locked Nodes
    65  As job consists of some tasks and each task corresponds to a pod, scheduler will select a series of nodes which can satisfy
    66  these pods. These nodes will be locked and no pod can be scheduled to them until the target job is scheduled.There are 
    67  three schemes as follows:
    68  #### Cluster Lock
    69  In order to schedule target job as soon as possible, lock all nodes in cluster to reserve resource for it. This scheme is 
    70  suitable for task type with fast throughput. As to long-running task, scheduler performance will be severely degraded.
    71  #### Multi-Node Lock
    72  In order to balance scheduler performance and resource reservation, we can lock part of nodes as locked nodes. Sort all
    73  nodes by idle resource amount and select N nodes whose idle resource are the most. The N can be a fixed number or percentage.
    74  Make sure the potential available resource can satisfy the request of target job. Then check if the available resource meets
    75  the target job's demand after exist tasks finishing in locked nodes releases resource every scheduling cycle. This scheme is
    76  suitable for users who are very experienced with his/her usage scenarios and the job type is almost the same(the run-time
    77  gap of tasks is not too large).
    78  #### Single-Node Lock
    79  Another way to lock N nodes is lock one node every scheduling cycle. The selected node has the most idle resource at that
    80  cycle. The dynamic selection process can alleviate the stereotype caused by one-time selection, especially in scene that
    81  task type is unpredictable and complex.
    82  
    83  ![Feature Design](./images/reservation_design.png)
    84  
    85  ## Implementation(v1.1.0)
    86  Volcano v1.1.0 has implemented that recognise target job and reserve resource automatically.
    87  ### Action
    88  Add two new action: elect and reserve. Elect action aims to find the target job. Reserve action is responsible for select
    89  locked nodes.
    90  #### Elect
    91  If no target job is elected, select one from jobs whose podgroup is `pending` and satisfy conditions  in `TargetJob`
    92  function registered in session object.
    93  #### Reserve
    94  If target job exists and is not ready, reserve nodes for it according algorithm implemented in `ReservedNodes` function
    95  registered in session object.
    96  ### Plugin 
    97  Add new Plugin reservation to implement algorithm detail about selecting target job and reserving nodes. `targetJobFn`
    98  selects job whose priority is the highest and waits for the longest time. `reservedNodesFn` reserve node whose idle resource
    99  is the max in every session.
   100  ![Workflow](./images/reservation_workflow.png)
   101  ### Recommend practice
   102  An example how to make use of this feature is to configure scheduler's configuration as follows:
   103  ```yaml
   104  actions: "enqueue, elect, allocate, backfill, reserve"
   105  configurations:
   106    - name: enqueue
   107      arguments:
   108        "overcommit-factor": 1.0
   109  tiers:
   110    - plugins:
   111    - name: priority
   112    - name: gang
   113    - name: conformance
   114    - name: reservation
   115    - plugins:
   116    - name: drf
   117    - name: predicates
   118    - name: proportion
   119    - name: nodeorder
   120    - name: binpack
   121  args:
   122  ```
   123  Note:
   124  * `elect` must be configured between `enqueue` and `allocate` 
   125  * `reserve` must be behind `allocate`
   126  * You'd better config `overcommit-factor` to `1.0` which is `1.2` by default for it may not select the most suitable 
   127    target job if not configure like this.
   128  
   129  ### TODO
   130  * support custom define percentage of cluster nodes as the upper limit of locked nodes number, which should be in the 
   131  form of pure decimal. Default value is 1.0.
   132  * support custom define wait duration, whose default value is 0.
   133  
   134  ## Implementation(v1.2.1)
   135  Optimized from feature implementation in v1.1.0, volcano v1.2.1 will update as follows:
   136  * New recommend action order: elect, enqueue, reserve, allocate. This order will lead to reserve nodes efficiently.
   137  * Support new nodes lock algorithm: cluster lock. Namely, lock all nodes for target job at once. It may result in shorter 
   138    scheduling time for target job. Custom configuration is supported for users to choose which lock algorithm via scheduler
   139    configuration file or plugin.
   140  * create plugin to cover the logic about `inqueue` status judgement in `enqueue` action. It will be more flexible to set
   141    `inqueue` condition.
   142  * Support users specify target job by set specified annotation on job. If not specified, volcano will select one and set
   143    annotation automatically.
   144  * Lock nodes will be labelled instead of stored in cache. Label will be removed after target job is running.
   145  * Select part nodes as lock nodes by default. Support users set `timeout` for max lock duration. If target job is not
   146    allocated resource out of `timeout`, cluster lock will work.
   147  * Put `nodes` as an argument for `Predicate` function
   148  * Select target job from jobs whose podgroup is `pending` and `inqueue`.