volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_job_policy.md (about)

     1  # How to Configure Volcano Job Policy
     2  ## Background
     3  `Policy` provides an API of volcano job and task lifecycle management for users. For example, in some scenarios, especially
     4  in AI, big data and HPC field, it is required to restart a job if any `master` or `worker` fails. Users can easily achieve that
     5  by configuring `policy` for the volcano job under `job.spec`.
     6  
     7  ## Key Points
     8  * Volcano allows users to configure a pair of `Event`(`Events`) and `Action` for a volcano job or a task. If the specified
     9  event(events) happens, the target action will be triggered.
    10  * If the policy is configured under `job.spec` only, it will work for all tasks by default. If the policy is configured
    11  under `task.spec` only, it will only work for the task. If the policy is configured in both job and task level, it will obey
    12  the task policy.
    13  * Users can set multiple policy for a job or a task.
    14  * Currently, Volcano provides **5 build-in events** for users. The details are as follows.
    15  
    16  | ID  | Event           | Description                                                                                                                                                                                                       |
    17  |-----|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    18  | 1   | `PodFailed`     | Check whether there is any pod' status is `Failed`.                                                                                                                                                               |
    19  | 2   | `PodEvicted`    | Check whether there is any pod is evicted.                                                                                                                                                                        |
    20  | 3   | `Unknown`       | Check whether the status of a volcano job is `Unknown`. The most possible factor is task unschedulable. It is triggered when part pods can't be scheduled while some are already running in gang-scheduling case. |
    21  | 4   | `TaskCompleted` | Check whether there is a task whose all pods are succeed. If `minsuccess` is configured for a task, it will also be regarded as task completes.                                                                   |
    22  | 5   | `*`             | It means all the events, which is not so common used.                                                                                                                                                             |
    23  
    24  * Currently, Volcano provides **5 build-in actions** for users. The details are as follows.
    25  
    26  | ID  | Action            | Description                                                                                                      |
    27  |-----|-------------------|------------------------------------------------------------------------------------------------------------------|
    28  | 1   | `AbortJob`        | Abort the whole job, but it can be resumed. All pods will be evicted and no pod will be recreated.               |
    29  | 2   | `RestartJob`      | Restart the whole job.                                                                                           |
    30  | 3   | `RestartTask`     | Default action. The task will be restarted. This action **cannot** work with job level events such as `Unknown`. |
    31  | 4   | `TerminateJob`    | Terminate the whole job and it **cannot** be resumed. All pods will be evicted and no pod will be recreated.     |
    32  | 5   | `CompleteJob`     | Regard the job as completed. The unfinished pods will be killed.                                                 |
    33  
    34  ## Examples
    35  1. Set a pair of `event` and `action`.
    36  ```yaml
    37  apiVersion: batch.volcano.sh/v1alpha1
    38  kind: Job
    39  metadata:
    40    name: tensorflow-dist-mnist
    41  spec:
    42    minAvailable: 3
    43    schedulerName: volcano
    44    plugins:
    45      env: []
    46      svc: []
    47    policies:
    48      - event: PodEvicted   # Job level policy. If any pod is evicted, restart the job. It will only work on `ps` task.
    49        action: RestartJob
    50    queue: default
    51    tasks:
    52      - replicas: 1
    53        name: ps
    54        template:
    55          spec:
    56            containers:
    57              - command:
    58                  - sh
    59                  - -c
    60                  - |
    61                    PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    62                    WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    63                    export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};   ## Get the index from the environment variable and configure it in the TF job.
    64                    python /var/tf_dist_mnist/dist_mnist.py
    65                image: volcanosh/dist-mnist-tf-example:0.0.1
    66                name: tensorflow
    67                ports:
    68                  - containerPort: 2222
    69                    name: tfjob-port
    70                resources: {}
    71            restartPolicy: Never
    72      - replicas: 2
    73        name: worker
    74        policies:
    75          - event: TaskCompleted    # Task level policy. If this task completes, complete the job.
    76            action: CompleteJob
    77        template:
    78          spec:
    79            containers:
    80              - command:
    81                  - sh
    82                  - -c
    83                  - |
    84                    PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    85                    WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    86                    export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
    87                    python /var/tf_dist_mnist/dist_mnist.py
    88                image: volcanosh/dist-mnist-tf-example:0.0.1
    89                name: tensorflow
    90                ports:
    91                  - containerPort: 2222
    92                    name: tfjob-port
    93                resources: {}
    94            restartPolicy: Never
    95  ```
    96  2. Set a pair of `events` and `action`.
    97  ```yaml
    98  apiVersion: batch.volcano.sh/v1alpha1
    99  kind: Job
   100  metadata:
   101    name: tensorflow-dist-mnist
   102  spec:
   103    minAvailable: 3
   104    schedulerName: volcano
   105    plugins:
   106      env: []
   107      svc: []
   108    queue: default
   109    tasks:
   110      - replicas: 1
   111        name: ps
   112        policies:
   113          - events: [PodEvicted, PodFailed]   # Task level policy. If any pod is evicted or fails in this task, restart the job.
   114            action: RestartJob
   115        template:
   116          spec:
   117            containers:
   118              - command:
   119                  - sh
   120                  - -c
   121                  - |
   122                    PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
   123                    WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
   124                    export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};   ## Get the index from the environment variable and configure it in the TF job.
   125                    python /var/tf_dist_mnist/dist_mnist.py
   126                image: volcanosh/dist-mnist-tf-example:0.0.1
   127                name: tensorflow
   128                ports:
   129                  - containerPort: 2222
   130                    name: tfjob-port
   131                resources: {}
   132            restartPolicy: Never
   133      - replicas: 2
   134        name: worker
   135        policies:
   136          - event: TaskCompleted  # Task level policy. If this task completes, complete the job.
   137            action: CompleteJob
   138        template:
   139          spec:
   140            containers:
   141              - command:
   142                  - sh
   143                  - -c
   144                  - |
   145                    PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
   146                    WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
   147                    export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
   148                    python /var/tf_dist_mnist/dist_mnist.py
   149                image: volcanosh/dist-mnist-tf-example:0.0.1
   150                name: tensorflow
   151                ports:
   152                  - containerPort: 2222
   153                    name: tfjob-port
   154                resources: {}
   155            restartPolicy: Never
   156  ```