volcano.sh/volcano@v1.9.0/docs/design/task-launch-order-within-job.md (about)

     1  # Task Launch Order Within Job
     2  
     3  @[hwdef](https://github.com/hwdef), @[Thor-wl](https://github.com/Thor-wl); July 19th, 2021
     4  
     5  ## Introduction
     6  
     7  This feature provides the ability to customize the order in which tasks are launched.
     8  
     9  ## Scope
    10  
    11  ### In Scope
    12  
    13  * Reasons why this feature is needed
    14  * Define the API
    15  
    16  ### Out of Scope
    17  
    18  * Start order of job
    19  * Task start sequence between multiple jobs
    20  * Dependency completion state of the task start sequence
    21  
    22  ## Scenarios
    23  
    24  * MPI Job. The worker pods need to be started first and then the master pod can run. Elsewise, the master pod can't setup the ssh tunnel and thus failed, this will add unnecessary waste of resources. In this case, MPI worker pods need to be in the running state, and then the master pod can start. Very similary, for the TensorFlow job, the TF master pod must be started first and then the TF worker pods.
    25  
    26  ## Scene Comparison
    27  
    28  | example   | number of tasks | task dependencies | concurrent execution | Solution Status   | Disadvantages of the current solution |
    29  | ---------- | -------- | ------------ | -------- | -------------------------------------------------- | -------------------------- |
    30  | MPI        | 2        | linear dependencies | yes    | 1. Add initcontainer to the task<br/>2.Using user-written check scripts in initcontainer | 1.Increase the cost of use for users<br />2.Still using resources while waiting |
    31  | matlab     | 2        | linear dependencies | yes    | 1. Add initcontainer to the task<br/>2.Using user-written check scripts in initcontainer |1.Increase the cost of use for users<br />2.Still using resources while waiting|
    32  | tensorflow | n>=2   | linear dependencies | yes    | 1. Add initcontainer to the task<br/>2.Using user-written check scripts in initcontainer |1.Increase the cost of use for users<br />2.Still using resources while waiting|
    33  
    34  ## Requirement
    35  
    36  Based on the scenarios listed above, the dependencies can be abstracted as:
    37  
    38  * Task-A depends on task-B, which means A must be started first and then B.
    39  * Triggering policy, in our cases, there may be only one trigger policy, which is the running state
    40  
    41  For the ease of end user's experince, we need to unify the way of composing a complicated jobs with lots of tasks, instead of letting user handle the complexcity themselvs using init-containers or other workflow tools. So we need to  have an more advanced VCJob that has below abilities:
    42  
    43  ## Design
    44  
    45  ### Field
    46  
    47  Add a field in `vcjob.spec.task.dependsOn`, it represents the name of the tasks that this task needs to depend on
    48  
    49  ### API
    50  vcjob example
    51  ```yaml
    52  apiVersion: batch.volcano.sh/v1alpha1
    53  kind: Job
    54  metadata:
    55    name: lm-mpi-job
    56  spec:
    57    ......
    58    tasks:
    59      - replicas: 2
    60        name: mpiworker
    61        template:
    62        ......
    63      - replicas: 1
    64        name: mpimaster
    65        dependsOn: 
    66          name: mpiworker
    67        template:
    68        ......
    69  ```
    70  
    71  ```yaml
    72  apiVersion: batch.volcano.sh/v1alpha1
    73  kind: Job
    74  metadata:
    75    name: example-job
    76  spec:
    77    ......
    78    tasks:
    79      - replicas: 2
    80        name: task1
    81        template:
    82        ......
    83      - replicas: 1
    84        name: task2
    85        template:
    86        ......
    87      - replicas: 2
    88        name: task3
    89        dependsOn: 
    90          name: task1, task2
    91          iteration: any
    92        template:
    93        ......
    94  ```
    95  
    96  ### Usage
    97  * create a job that contains at least two tasks, fill in the task name in the `vcjob.spec.task.dependsOn` field, this name indicates the task that this task wants to rely on.
    98  * If there are multiple dependent tasks, you need to fill in the `iteration` field, the value can be `any` or `all`, `any` means that one of the multiple tasks reach the running state then run this task, `all` means that all tasks reach the runnig state before running this task.
    99  * get task status to check if it is correct order
   100  
   101  ### Implementaion
   102  * create a new field in vcjob
   103  * create admission for this field
   104  
   105  ### Notice
   106  * deal with the conflict with gang-scheduling, you need to disable the gang plugin when using it.