volcano.sh/volcano@v1.9.0/docs/design/task-launch-order-within-job.md (about) 1 # Task Launch Order Within Job 2 3 @[hwdef](https://github.com/hwdef), @[Thor-wl](https://github.com/Thor-wl); July 19th, 2021 4 5 ## Introduction 6 7 This feature provides the ability to customize the order in which tasks are launched. 8 9 ## Scope 10 11 ### In Scope 12 13 * Reasons why this feature is needed 14 * Define the API 15 16 ### Out of Scope 17 18 * Start order of job 19 * Task start sequence between multiple jobs 20 * Dependency completion state of the task start sequence 21 22 ## Scenarios 23 24 * MPI Job. The worker pods need to be started first and then the master pod can run. Elsewise, the master pod can't setup the ssh tunnel and thus failed, this will add unnecessary waste of resources. In this case, MPI worker pods need to be in the running state, and then the master pod can start. Very similary, for the TensorFlow job, the TF master pod must be started first and then the TF worker pods. 25 26 ## Scene Comparison 27 28 | example | number of tasks | task dependencies | concurrent execution | Solution Status | Disadvantages of the current solution | 29 | ---------- | -------- | ------------ | -------- | -------------------------------------------------- | -------------------------- | 30 | MPI | 2 | linear dependencies | yes | 1. Add initcontainer to the task<br/>2.Using user-written check scripts in initcontainer | 1.Increase the cost of use for users<br />2.Still using resources while waiting | 31 | matlab | 2 | linear dependencies | yes | 1. Add initcontainer to the task<br/>2.Using user-written check scripts in initcontainer |1.Increase the cost of use for users<br />2.Still using resources while waiting| 32 | tensorflow | n>=2 | linear dependencies | yes | 1. Add initcontainer to the task<br/>2.Using user-written check scripts in initcontainer |1.Increase the cost of use for users<br />2.Still using resources while waiting| 33 34 ## Requirement 35 36 Based on the scenarios listed above, the dependencies can be abstracted as: 37 38 * Task-A depends on task-B, which means A must be started first and then B. 39 * Triggering policy, in our cases, there may be only one trigger policy, which is the running state 40 41 For the ease of end user's experince, we need to unify the way of composing a complicated jobs with lots of tasks, instead of letting user handle the complexcity themselvs using init-containers or other workflow tools. So we need to have an more advanced VCJob that has below abilities: 42 43 ## Design 44 45 ### Field 46 47 Add a field in `vcjob.spec.task.dependsOn`, it represents the name of the tasks that this task needs to depend on 48 49 ### API 50 vcjob example 51 ```yaml 52 apiVersion: batch.volcano.sh/v1alpha1 53 kind: Job 54 metadata: 55 name: lm-mpi-job 56 spec: 57 ...... 58 tasks: 59 - replicas: 2 60 name: mpiworker 61 template: 62 ...... 63 - replicas: 1 64 name: mpimaster 65 dependsOn: 66 name: mpiworker 67 template: 68 ...... 69 ``` 70 71 ```yaml 72 apiVersion: batch.volcano.sh/v1alpha1 73 kind: Job 74 metadata: 75 name: example-job 76 spec: 77 ...... 78 tasks: 79 - replicas: 2 80 name: task1 81 template: 82 ...... 83 - replicas: 1 84 name: task2 85 template: 86 ...... 87 - replicas: 2 88 name: task3 89 dependsOn: 90 name: task1, task2 91 iteration: any 92 template: 93 ...... 94 ``` 95 96 ### Usage 97 * create a job that contains at least two tasks, fill in the task name in the `vcjob.spec.task.dependsOn` field, this name indicates the task that this task wants to rely on. 98 * If there are multiple dependent tasks, you need to fill in the `iteration` field, the value can be `any` or `all`, `any` means that one of the multiple tasks reach the running state then run this task, `all` means that all tasks reach the runnig state before running this task. 99 * get task status to check if it is correct order 100 101 ### Implementaion 102 * create a new field in vcjob 103 * create admission for this field 104 105 ### Notice 106 * deal with the conflict with gang-scheduling, you need to disable the gang plugin when using it.