volcano.sh/volcano@v1.9.0/docs/design/jobflow/README.md (about)

     1  # JobFlow
     2  
     3  ## Introduction
     4  
     5  In order to solve the problem of inter-job dependencies. We need many VCJobs to cooperate each other and orchestrate them manually or by another Job Orchestration Platform to get the job done finally.We present an new way of orchestrating VCJobs called JobFlow. We proposed two concepts to running multiple batch jobs automatically named JobTemplate and JobFlow so end users can easily declare their jobs and run them using complex controlling primitives, for example, sequential or parallel executing, if-then-else statement, switch-case statement, loop executing and so on.
     6  
     7  JobFlow helps migrating AI, BigData, HPC workloads to the cloud-native world. Though there are already some workload flow engines, they are not designed for batch job workloads. Those jobs typically have a complex running dependencies and take long time to run, for example days or weeks. JobFlow helps the end users to declare their jobs as an jobTemplate and then reuse them accordingly. Also, JobFlow orchestrating those jobs using complex controlling primitives and launch those jobs automatically. This can significantly reduce the time consumption of an complex job and improve resource utilization. Finally, JobFlow is not an generally purposed workflow engine, it knows the details of VCJobs. End user can have a better understanding of their jobs, for example, job's running state, beginning and ending timestamps, the next jobs to run, pod-failure-ratio and so on.
     8  
     9  ## Scope
    10  
    11  ### In Scope
    12  - Define the API of JobFlow
    13  - Define the behaviour of JobFlow
    14  - Start sequence between multiple jobs
    15  - Dependency completion state of the job start sequence
    16  - DAG-based job dependency startup
    17  
    18  ### Out of Scope
    19  - Supports other job
    20  - Achieve vcjobs level gang
    21  
    22  ## Scenarios
    23  
    24  - Some jobs need to depend on the completion of the previous job or other status when running, etc. Otherwise, the correct result cannot be calculated.
    25  - Sometimes inter-job dependencies also require diverse dependency types, such as conditional dependencies, circular dependencies, probes, and so on.
    26  
    27  ![jobflow-1.png](../images/jobflow-1.png)
    28  
    29  ## Design
    30  
    31  ![jobflow-2.png](../images/jobflow-2.png)
    32  
    33  The blue part is the components of k8s itself, the orange is the existing definition of Volcano, and the red is the new definition of JobFlow.
    34  
    35  **jobflow job submission complete process**:
    36  
    37  1. After passing the Admission. kubectl will create JobTemplate and JobFlow (Volcano CRD) objects in kube-apiserver.
    38  
    39  2.  The JobFlowController uses the JobTemplate as a template according to the configuration of the JobFlow, and creates the corresponding VcJob according to the flow dependency rules.
    40  
    41  3.  After VcJob is created, VcJobController creates corresponding Pods and podgroups according to the configuration of VcJob.
    42  
    43  4.  After Pod and PodGroup are created, vc-scheduler will go to kube-apiserver to get Pod/PodGroup and node information.
    44  
    45  5. After obtaining the information, vc-scheduler will select the appropriate node for each Pod according to its configured scheduling policy.
    46  
    47  6. After assigning nodes to Pods, kubelet will get the Pod's configuration from kube-apiserver and start the corresponding containers.
    48  
    49  **update jobflow**:
    50  
    51  Currently, jobflow does not support the update operation, and the update of jobflow will be blocked through webhook.
    52  
    53  **delete jobflow**:
    54  
    55  Deleting a jobflow when the jobflow is in a non-complete state will be intercepted by the webhook. otherwise, after deleting jobflow, all vcjobs created by jobflow will be deleted directly.
    56  
    57  ### Controller
    58  
    59  ![jobflow-3.png](../images/jobflow-3.png)
    60  
    61  ### Webhook
    62  
    63  ```
    64  Create a JobFlow check
    65  1、There cannot be a template with the same name in a JobFlow dependency
    66    Such as: A->B->A->C A appears twice
    67  2、Closed loops cannot occur in JobFlow
    68    E.g:A -> B  ->  C
    69            ^     /
    70            |    /
    71            < - D
    72  
    73  Create a JobTemplte check (following the vcjob parameter specification)
    74  E.g: job minAvailable must be greater than or equal to zero
    75     job maxRetry must be greater than or equal to zero
    76     tasks cannot be empty, and cannot have tasks with the same name
    77     The number of task replicas cannot be less than zero
    78     task minAvailable cannot be greater than task replicas...
    79  ```
    80  
    81  ### JobFlow
    82  
    83  #### Introduction
    84  
    85  JobFlow defines the running flow of a set of jobs. Fields in JobFlow define how jobs are orchestrated.
    86  
    87  JobFlow is abbreviated as jf, and the resource can be viewed through kubectl get jf
    88  
    89  JobFlow aims to realize job-dependent operation between vcjobs in volcano. According to the dependency between vcjob, vcjob is issued.
    90  
    91  #### Key Fields
    92  
    93  ##### Top-Level Attributes
    94  
    95  The top-level attributes of a jobflow define its apiVersion, kind, metadata and spec.
    96  
    97  | Attribute    | Type                    | Required | Default Value              | Description                                                  |
    98  | ------------ | ----------------------- | -------- | -------------------------- | ------------------------------------------------------------ |
    99  | `apiVersion` | `string`                | Y        | `flow.volcano.sh/v1alpha1` | A string that identifies the version of the schema the object should have. The core types uses `flow.volcano.sh/v1alpha1` in this version of documentation. |
   100  | `kind`       | `string`                | Y        | `JobFlow`                  | Must be `JobFlow`                                            |
   101  | `metadata`   | [`Metadata`](#Metadata) | Y        |                            | Information about the JobFlow resource.                          |
   102  | `spec`       | [`Spec`](#spec)         | Y        |                            | A specification for the JobFlow resource attributes.             |
   103  | `status`       | [`Status`](#Status)         | Y        |                            | A specification for the JobFlow status attributes.             |
   104  
   105  <a id="Metadata"></a>
   106  
   107  ##### Metadata
   108  
   109  Metadata provides basic information about the JobFlow.
   110  
   111  | Attribute     | Type                | Required | Default Value | Description                                                  |
   112  | ------------- | ------------------- | -------- | ------------- | ------------------------------------------------------------ |
   113  | `name`        | `string`            | Y        |               | A name for the schematic. `name` is subject to the restrictions listed beneath this table. |
   114  | `namespace`        | `string`            | Y        |               | A namespace for the schematic. `namespace` is subject to the restrictions listed beneath this table. |
   115  | `labels`      | `map[string]string` | N        |               | A set of string key/value pairs used as arbitrary labels on this component. Labels follow the [Kubernetes specification](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). |
   116  | `annotations` | `map[string]string` | N        |               | A set of string key/value pairs used as arbitrary descriptive text associated with this object.  Annotations follows the [Kubernetes specification](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set). |
   117  
   118  <a id="Spec"></a>
   119  
   120  ##### Spec
   121  
   122  The specification of cloud-native services defines service metadata, version list, service capabilities and plugins.
   123  
   124  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   125  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   126  | `flows`           | [`Flow array`](#Flow) | Y        |               | Describes the dependencies between vcjobs. |
   127  | `jobRetainPolicy` | `string`                             | Y        | retain | After JobFlow succeed, keep the generated job. Otherwise, delete it. |
   128  
   129  <a id="Flow"></a>
   130  
   131  ##### Flow
   132  
   133  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   134  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   135  | `name`       | `string` | Y        |               | JobTemplate name |
   136  | `dependsOn` | [`DependsOn`](#DependsOn)                             | Y        |               | JobTemplate dependencies |
   137  | `patch` | [`Patch`](#Patch)                             | N        |               | Patch JobTemplate |
   138  
   139  <a id="DependsOn"></a>
   140  
   141  ##### DependsOn
   142  
   143  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   144  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   145  | `targets` | `string array` | Y        |               | All jobtemplate names that JobTemplate depends on |
   146  | `probe` | [`Probe`](#Probe)                   | N       |               | Probe Type Dependency |
   147  | `strategy` | `string` | Y        | all | Whether the dependencies need to be all satisfied |
   148  
   149  <a id="Patch"></a>
   150  
   151  ##### Patch
   152  
   153  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   154  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   155  | `spec` | `spec` | Y        |               | Patch the contents of the jobtemplate's spec |
   156  
   157  <a id="Probe"></a>
   158  
   159  ##### Probe
   160  
   161  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   162  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   163  | `httpGetList` | [`HttpGet array`](#HttpGet) | N       |               | HttpGet type dependencies |
   164  | `tcpSocketList` | [`TcpSocket array`](#TcpSocket) | N       |               | TcpSocket type dependencies |
   165  | `taskStatusList` | [`TaskStatus array`](#TaskStatus) | N       |  | TaskStatus type dependencies |
   166  
   167  <a id="HttpGet"></a>
   168  
   169  ##### HttpGet
   170  
   171  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   172  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   173  | `TaskName` | `string` | Y        |               | The name of the task under vcjob |
   174  | `Path` | [`Probe`](#Probe)                   | Y      |               | The path of httpget |
   175  | `Port` | `int` | Y        |  | The port of httpget              |
   176  | `httpHeader` | `HTTPHeader` | N      |  | The httpHeader of httpget |
   177  
   178  <a id="TcpSocket"></a>
   179  
   180  ##### TcpSocket
   181  
   182  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   183  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   184  | `TaskName` | `string` | Y        |               | The name of the task under vcjob |
   185  | `Port` | `int` | Y        |  | The port of TcpSocket     |
   186  
   187  <a id="TaskStatus"></a>
   188  
   189  ##### TaskStatus
   190  
   191  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   192  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   193  | `TaskName` | `string` | Y        |               | The name of the task under vcjob |
   194  | `Phase` | `string`              | Y      |               | The phase of task |
   195  
   196  <a id="Status"></a>
   197  
   198  ##### Status
   199  
   200  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   201  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   202  | `pendingJobs` | `string array` | N       |               | Vcjobs in pending state |
   203  | `runningJobs` | `string array` | N       |               | Vcjobs in running state |
   204  | `failedJobs` | `string array` | N       |               | Vcjobs in failed state |
   205  | `completedJobs` | `string array` | N       |               | Vcjobs in completed and completing state |
   206  | `terminatedJobs` | `string array` | N       |               | Vcjobs in terminated and terminating state |
   207  | `unKnowJobs` | `string array` | N       |               | Vcjobs in pending state |
   208  | `jobStatusList` | [`JobStatus array`](#JobStatus) | N       |               | Status information of all split vcjobs |
   209  | `conditions` | [`map[string]Condition`](#Condition) | N       |               | It is used to describe the current state, creation time, completion time and information of all vcjobs. The vcjob state here additionally adds the waiting state to describe the vcjob whose dependencies do not meet the requirements. |
   210  | `state` | [`State`](#State) | N       |               | State of JobFlow |
   211  
   212  <a id="JobStatus"></a>
   213  
   214  ##### JobStatus
   215  
   216  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   217  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   218  | `name` | `string` | N       |               | Name of vcjob |
   219  | `state` | `string` | N       |               | State of vcJob |
   220  | `startTimestamp` | `Time` | N       |               | StartTimestamp of vcjob |
   221  | `endTimestamp` | `Time` | N       |               | EndTimestamp of vcjob |
   222  | `restartCount` | `int32` | N       |               | RestartCount of vcjob |
   223  | `runningHistories` | [`JobRunningHistory array`](#JobRunningHistory) | N       |               | Historical information of various states of vcjob |
   224  
   225  <a id="Condition"></a>
   226  
   227  ##### Condition
   228  
   229  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   230  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   231  | `phase` | `string` | N       |               | phase of vcjob |
   232  | `createTime` | `Time` | N       |               | CreateTime of vcjob |
   233  | `runningDuration` | `Duration` | N       |               | RunningDuration of vcjob |
   234  | `taskStatusCount` | `map[string]TaskState` | N       |               | The number of tasks in different states |
   235  
   236  <a id="State"></a>
   237  
   238  ##### State
   239  
   240  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   241  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   242  | `phase` | `string` | N     |               | Succeed: All vcjobs have reached completed state. <br/>Terminating: Jobflow is deleting. <br/>Failed: A vcjob in the flow is in the failed state, so the vcjob in the flow cannot continue to be delivered. <br/>Running: Flow contains vcjob in Running state。<br/>Pending: When the vcjob under jobflow is not in the above situation, jobflow is in pending state. |
   243  
   244  <a id="JobRunningHistory"></a>
   245  
   246  ##### JobRunningHistory
   247  
   248  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   249  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   250  | `startTimestamp` | `Time` | N     |               | The start time of a certain state of the vcjob |
   251  | `endTimestamp` | `Time` | N     |               | The end time of a certain state of the vcjob |
   252  | `state` | `string` | N     |               | Vcjob status |
   253  
   254  **Scope of influence of JobFlow state change**:
   255  
   256  Changes in the current JobFlow state will not affect other resources.
   257  
   258  **JobFlow supports the functionality of the JobTemplate patch. The example in JobFlow is as follows**:
   259  
   260  ```
   261  apiVersion: flow.volcano.sh/v1alpha1
   262  kind: JobFlow
   263  metadata:
   264    name: test
   265    namespace: default
   266  spec:
   267    jobRetainPolicy: delete  
   268    flows:
   269    - name: a
   270      patch: 
   271        spec:
   272          tasks:
   273          - name: "default-nginx"
   274            template:
   275              spec:
   276                containers:
   277                - name: nginx
   278                  command:
   279                    - sh
   280                    - -c
   281                    - sleep 10s
   282  ```
   283  
   284  Here is an example of jobflow:
   285  
   286  [the sample file of JobFlow](../../../example/jobflow/JobFlow.yaml)
   287  
   288  ### JobTemplate
   289  
   290  #### Introduction
   291  
   292  * JobTemplate is the template of vcjob, after JobTemplate is created, it will not be processed by vc-controller like vcjob, it will wait to be referenced by JobFlow.
   293  * JobFlow can reference multiple jobtemplates
   294  * A jobtemplate can be referenced by multiple jobflows
   295  * JobTemplate can be converted to and from vcjob.
   296  * Jobtemplate is abbreviated as jt, and the resource can be viewed through kubectl get jt
   297  * The difference between jobtemplate and vcjob is that jobtemplate will not be issued by the job controller, and jobflow can directly reference the name of the JobTemplate to implement the issuance of vcjob.
   298  * JobFlow supports making changes to jobtemplate when referencing jobtemplate
   299  
   300  ####action of jobtemplate and response impact
   301  
   302  **create jobtemplate**:
   303  
   304  Create a jobtemplate to be used by jobflow.
   305  
   306  **update jobtemplate**:
   307  
   308  After the jobtemplate is updated, it will not affect the vcjobs that have been created based on the jobtemplate. It will not affect the successfully executed jobflow. It may affect the jobflow that has not been executed. For example, the jobflow that has not been executed to the jobtemplate stage will use the updated jobtemplate template.
   309  
   310  **delete jobtemplate**:
   311  
   312  When the jobtemplate is being referenced by a non-complete jobflow, the webhook will intercept the jobtemplate deletion request.
   313  
   314  #### Key Fields
   315  
   316  ##### Top-Level Attributes
   317  
   318  The top-level attributes of a jobtemplate define its apiVersion, kind, metadata and spec.
   319  
   320  | Attribute    | Type                    | Required | Default Value              | Description                                                  |
   321  | ------------ | ----------------------- | -------- | -------------------------- | ------------------------------------------------------------ |
   322  | `apiVersion` | `string`                | Y        | `flow.volcano.sh/v1alpha1` | A string that identifies the version of the schema the object should have. The core types uses `flow.volcano.sh/v1alpha1` in this version of documentation. |
   323  | `kind`       | `string`                | Y        | `JobTemplate`           | Must be `JobTemplate`                                     |
   324  | `metadata`   | [`Metadata`](#JobTemplateMetadata) | Y        |                            | Information about the JobTemplate resource.              |
   325  | `spec`       | [`Spec`](#JobTemplateSpec) | Y        |                            | A specification for the JobTemplate resource attributes. |
   326  | `status`       | [`Status`](# JobTemplateStatus) | Y        |                            | A specification for the JobTemplate status attributes. |
   327  
   328  <a id="JobTemplateMetadata"></a>
   329  
   330  ##### Metadata
   331  
   332  Metadata provides basic information about the JobTemplate.
   333  
   334  | Attribute     | Type                | Required | Default Value | Description                                                  |
   335  | ------------- | ------------------- | -------- | ------------- | ------------------------------------------------------------ |
   336  | `name`        | `string`            | Y        |               | A name for the schematic. `name` is subject to the restrictions listed beneath this table. |
   337  | `namespace`        | `string`            | Y        |               | A namespace for the schematic. `namespace` is subject to the restrictions listed beneath this table. |
   338  | `labels`      | `map[string]string` | N        |               | A set of string key/value pairs used as arbitrary labels on this component. Labels follow the [Kubernetes specification](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). |
   339  | `annotations` | `map[string]string` | N        |               | A set of string key/value pairs used as arbitrary descriptive text associated with this object.  Annotations follows the [Kubernetes specification](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set). |
   340  
   341  <a id="JobTemplateSpec"></a>
   342  
   343  ##### JobTemplateSpec
   344  
   345  The spec of jobtemplate directly follows the spec of vcjob.
   346  
   347  <a id="JobTemplateStatus"></a>
   348  
   349  ##### JobTemplateStatus
   350  | Attribute         | Type                                 | Required | Default Value | Description                                                  |
   351  | ----------------- | ------------------------------------ | -------- | ------------- | ------------------------------------------------------------ |
   352  | `jobDependsOnList` | `string array` | Y        |               | Vcjobs created by this jobtemplate as a template. |
   353  
   354  You can view [the sample file of JobTemplate](../../../example/jobflow/JobTemplate.yaml)
   355  
   356  ## JobFlow task scheduling
   357  
   358  ![jobflowAnimation](../images/jobflow.gif)
   359  
   360  ## Demo video
   361  
   362  https://www.bilibili.com/video/BV1c44y1Y7FX
   363  
   364  ## Usage
   365  
   366  - Create the jobTemplate that needs to be used
   367  - Create a jobflow. The flow field of the jobflow is filled with the corresponding jobtemplate used to create a vcjob.
   368  - The field jobRetainPolicy indicates whether to delete the vcjob created by the jobflow after the jobflow succeeds. (delete/retain) default is retain.
   369  
   370  ## JobFlow Features
   371  
   372  ### Features that have been implemented
   373  
   374  * Create JobFlow and JobTemplate CRD
   375  * Support sequential start of vcjob
   376  * Support vcjob to depend on other vcjobs to start
   377  * Support the conversion of vcjob and JobTemplate to each other
   378  * Supports viewing of the running status of JobFlow
   379  
   380  ### Features not yet implemented
   381  
   382  * JobFlow supports making changes to jobtemplate when referencing jobtemplate
   383  * `if` statements
   384  * `switch` statements
   385  * `for` statements
   386  * Support job failure retry in JobFlow
   387  * Integration with volcano-scheduler
   388  * Support for scheduling plugins at JobFlow level