volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_job_policy.md (about) 1 # How to Configure Volcano Job Policy 2 ## Background 3 `Policy` provides an API of volcano job and task lifecycle management for users. For example, in some scenarios, especially 4 in AI, big data and HPC field, it is required to restart a job if any `master` or `worker` fails. Users can easily achieve that 5 by configuring `policy` for the volcano job under `job.spec`. 6 7 ## Key Points 8 * Volcano allows users to configure a pair of `Event`(`Events`) and `Action` for a volcano job or a task. If the specified 9 event(events) happens, the target action will be triggered. 10 * If the policy is configured under `job.spec` only, it will work for all tasks by default. If the policy is configured 11 under `task.spec` only, it will only work for the task. If the policy is configured in both job and task level, it will obey 12 the task policy. 13 * Users can set multiple policy for a job or a task. 14 * Currently, Volcano provides **5 build-in events** for users. The details are as follows. 15 16 | ID | Event | Description | 17 |-----|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 18 | 1 | `PodFailed` | Check whether there is any pod' status is `Failed`. | 19 | 2 | `PodEvicted` | Check whether there is any pod is evicted. | 20 | 3 | `Unknown` | Check whether the status of a volcano job is `Unknown`. The most possible factor is task unschedulable. It is triggered when part pods can't be scheduled while some are already running in gang-scheduling case. | 21 | 4 | `TaskCompleted` | Check whether there is a task whose all pods are succeed. If `minsuccess` is configured for a task, it will also be regarded as task completes. | 22 | 5 | `*` | It means all the events, which is not so common used. | 23 24 * Currently, Volcano provides **5 build-in actions** for users. The details are as follows. 25 26 | ID | Action | Description | 27 |-----|-------------------|------------------------------------------------------------------------------------------------------------------| 28 | 1 | `AbortJob` | Abort the whole job, but it can be resumed. All pods will be evicted and no pod will be recreated. | 29 | 2 | `RestartJob` | Restart the whole job. | 30 | 3 | `RestartTask` | Default action. The task will be restarted. This action **cannot** work with job level events such as `Unknown`. | 31 | 4 | `TerminateJob` | Terminate the whole job and it **cannot** be resumed. All pods will be evicted and no pod will be recreated. | 32 | 5 | `CompleteJob` | Regard the job as completed. The unfinished pods will be killed. | 33 34 ## Examples 35 1. Set a pair of `event` and `action`. 36 ```yaml 37 apiVersion: batch.volcano.sh/v1alpha1 38 kind: Job 39 metadata: 40 name: tensorflow-dist-mnist 41 spec: 42 minAvailable: 3 43 schedulerName: volcano 44 plugins: 45 env: [] 46 svc: [] 47 policies: 48 - event: PodEvicted # Job level policy. If any pod is evicted, restart the job. It will only work on `ps` task. 49 action: RestartJob 50 queue: default 51 tasks: 52 - replicas: 1 53 name: ps 54 template: 55 spec: 56 containers: 57 - command: 58 - sh 59 - -c 60 - | 61 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 62 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 63 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job. 64 python /var/tf_dist_mnist/dist_mnist.py 65 image: volcanosh/dist-mnist-tf-example:0.0.1 66 name: tensorflow 67 ports: 68 - containerPort: 2222 69 name: tfjob-port 70 resources: {} 71 restartPolicy: Never 72 - replicas: 2 73 name: worker 74 policies: 75 - event: TaskCompleted # Task level policy. If this task completes, complete the job. 76 action: CompleteJob 77 template: 78 spec: 79 containers: 80 - command: 81 - sh 82 - -c 83 - | 84 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 85 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 86 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; 87 python /var/tf_dist_mnist/dist_mnist.py 88 image: volcanosh/dist-mnist-tf-example:0.0.1 89 name: tensorflow 90 ports: 91 - containerPort: 2222 92 name: tfjob-port 93 resources: {} 94 restartPolicy: Never 95 ``` 96 2. Set a pair of `events` and `action`. 97 ```yaml 98 apiVersion: batch.volcano.sh/v1alpha1 99 kind: Job 100 metadata: 101 name: tensorflow-dist-mnist 102 spec: 103 minAvailable: 3 104 schedulerName: volcano 105 plugins: 106 env: [] 107 svc: [] 108 queue: default 109 tasks: 110 - replicas: 1 111 name: ps 112 policies: 113 - events: [PodEvicted, PodFailed] # Task level policy. If any pod is evicted or fails in this task, restart the job. 114 action: RestartJob 115 template: 116 spec: 117 containers: 118 - command: 119 - sh 120 - -c 121 - | 122 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 123 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 124 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job. 125 python /var/tf_dist_mnist/dist_mnist.py 126 image: volcanosh/dist-mnist-tf-example:0.0.1 127 name: tensorflow 128 ports: 129 - containerPort: 2222 130 name: tfjob-port 131 resources: {} 132 restartPolicy: Never 133 - replicas: 2 134 name: worker 135 policies: 136 - event: TaskCompleted # Task level policy. If this task completes, complete the job. 137 action: CompleteJob 138 template: 139 spec: 140 containers: 141 - command: 142 - sh 143 - -c 144 - | 145 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 146 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 147 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; 148 python /var/tf_dist_mnist/dist_mnist.py 149 image: volcanosh/dist-mnist-tf-example:0.0.1 150 name: tensorflow 151 ports: 152 - containerPort: 2222 153 name: tfjob-port 154 resources: {} 155 restartPolicy: Never 156 ```