volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_svc_plugin.md (about) 1 # Volcano Job Plugin -- SVC User Guidance 2 3 ## Background 4 **SVC Plugin** is designed for the communication for pods within a volcano job, which is essential for workloads such as 5 [TensorFlow](https://tensorflow.google.cn/) and [MPI](https://www.open-mpi.org/). For example, it is necessary for 6 `tensorflow job` to contact with each other between `ps` and `worker`. Volcano job plugin `svc` enable pods within a job 7 to visit each other by domain. 8 9 ## Key Points 10 * Once `svc` plugin is configured, value of field `hostname` under `spec` will be filled out to be **the pod name** for 11 all pods under the job automatically. Namely, `pod.spec.hostname` is `pod.metadata.name`. 12 * Once `svc` plugin is configured, value of field `subdomain` under `spec` will be filled out to be **the job name** for 13 all pods under the job automatically. Namely, `pod.spec.subdomain` is `job.metadata.name`. 14 * Once `svc` plugin is configured, environment variables `VC_%s_NUM` will be registered to all the containers(including 15 initContainers) under the job automatically. `%s` will be replaced by the **task name** which the pod belongs to. The value 16 of the environment variable is the **task replicas**. The number of the environment variables depends on the number of tasks, 17 which is usually `2` for most AI and Big Data jobs contains 2 roles. For example, a Spark job contains `driver` and `executor`. 18 * Once `svc` plugin is configured, environment variables `VC_%s_HOSTS` will be registered to all the containers(including 19 initContainers) under the job automatically. `%s` will be replaced by the **task name** which the pod belongs to. The value 20 of the environment variable are the domains of all the pods under the task. The number of the environment variables depends 21 on the number of tasks, which is usually `2` for most AI and Big Data jobs contains 2 roles. For example, a TensorFlow job 22 contains `ps` and `worker`. 23 * A configmap whose name joins job-name and `svc` with `-` will be created automatically, which contains replicas of all 24 tasks and domains of all pods under the task. It will be mounted as a volume for all pods under the job and serves as the 25 host files under the directory `/etc/volcano/`. 26 * A headless service whose name is the same with job will be created. 27 * If `disable-network-policy` is set to be false, a `NetworkPolicy` object with the type `Ingress` will be created for 28 the job. 29 30 ## Arguments 31 | ID | Name | Value | Default Value | Required | Description | Example | 32 |-----|-------------------------------|-----------------|---------------|----------|------------------------------------------------------|-----------------------------------------------| 33 | 1 | `publish-not-ready-addresses` | `true`/`false` | `false` | N | whether publish the pod address when it is not ready | svc: ["--publish-not-ready-addresses=true"] | 34 | 2 | `disable-network-policy` | `true`/`false` | `false` | N | whether disable network policy for the job | svc: ["--disable-network-policy=true"] | 35 36 ## Examples 37 ```yaml 38 apiVersion: batch.volcano.sh/v1alpha1 39 kind: Job 40 metadata: 41 name: tensorflow-dist-mnist 42 spec: 43 minAvailable: 3 44 schedulerName: volcano 45 plugins: 46 env: [] 47 svc: ["--publish-not-ready-addresses=false", "--disable-network-policy=false"] ## SVC plugin register 48 policies: 49 - event: PodEvicted 50 action: RestartJob 51 queue: default 52 tasks: 53 - replicas: 1 54 name: ps 55 template: 56 spec: 57 containers: 58 - command: 59 - sh 60 - -c 61 - | 62 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; ## Get host domain from host files generated 63 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 64 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; 65 python /var/tf_dist_mnist/dist_mnist.py 66 image: volcanosh/dist-mnist-tf-example:0.0.1 67 name: tensorflow 68 ports: 69 - containerPort: 2222 70 name: tfjob-port 71 resources: {} 72 restartPolicy: Never 73 - replicas: 2 74 name: worker 75 policies: 76 - event: TaskCompleted 77 action: CompleteJob 78 template: 79 spec: 80 containers: 81 - command: 82 - sh 83 - -c 84 - | 85 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 86 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 87 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; 88 python /var/tf_dist_mnist/dist_mnist.py 89 image: volcanosh/dist-mnist-tf-example:0.0.1 90 name: tensorflow 91 ports: 92 - containerPort: 2222 93 name: tfjob-port 94 resources: {} 95 restartPolicy: Never 96 ``` 97 Note: 98 * Fields `hostname` and `subdomain` have been added to the pods under job `tensorflow-dist-mnist`. The following is part 99 yaml of the `ps` pod. 100 ```yaml 101 apiVersion: v1 102 kind: Pod 103 metadata: 104 annotations: 105 scheduling.k8s.io/group-name: tensorflow-dist-mnist 106 volcano.sh/job-name: tensorflow-dist-mnist 107 volcano.sh/job-version: "0" 108 volcano.sh/queue-name: default 109 volcano.sh/task-spec: ps 110 volcano.sh/template-uid: tensorflow-dist-mnist-ps 111 labels: 112 volcano.sh/job-name: tensorflow-dist-mnist 113 volcano.sh/job-namespace: default 114 volcano.sh/queue-name: default 115 volcano.sh/task-spec: ps 116 name: tensorflow-dist-mnist-ps-0 117 namespace: default 118 ownerReferences: 119 - apiVersion: batch.volcano.sh/v1alpha1 120 blockOwnerDeletion: true 121 controller: true 122 kind: Job 123 name: tensorflow-dist-mnist 124 uid: 52c98cc2-4791-490f-8572-22df2c16ef8f 125 resourceVersion: "855403" 126 uid: 1b9e834b-de7e-4760-9b23-2a673d38e5d9 127 spec: 128 containers: 129 - command: 130 - sh 131 - -c 132 - | 133 PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; ## Get host domain from host files generated 134 WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; 135 export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; 136 python /var/tf_dist_mnist/dist_mnist.py 137 env: 138 - name: VK_TASK_INDEX 139 value: "0" 140 - name: VC_TASK_INDEX 141 value: "0" 142 - name: VC_PS_HOSTS ## Environment variable `VC_PS_HOSTS` contains the domains of all the `ps` hosts. 143 valueFrom: 144 configMapKeyRef: 145 key: VC_PS_HOSTS 146 name: tensorflow-dist-mnist-svc 147 - name: VC_PS_NUM ## Environment variable `VC_PS_NUM` contains the number of `ps` hosts. 148 valueFrom: 149 configMapKeyRef: 150 key: VC_PS_NUM 151 name: tensorflow-dist-mnist-svc 152 - name: VC_WORKER_HOSTS ## Environment variable `VC_WORKER_HOSTS` contains the domains of all the `worker` hosts. 153 valueFrom: 154 configMapKeyRef: 155 key: VC_WORKER_HOSTS 156 name: tensorflow-dist-mnist-svc 157 - name: VC_WORKER_NUM ## Environment variable `VC_WORKER_NUM` contains the number of `worker` hosts. 158 valueFrom: 159 configMapKeyRef: 160 key: VC_WORKER_NUM 161 name: tensorflow-dist-mnist-svc 162 image: volcanosh/dist-mnist-tf-example:0.0.1 163 name: tensorflow 164 ports: 165 - containerPort: 2222 166 name: tfjob-port 167 protocol: TCP 168 resources: {} 169 volumeMounts: ## Mount the configmap generated for the job under `/etc/volcano`, which contains all host files. 170 - mountPath: /etc/volcano 171 name: tensorflow-dist-mnist-svc 172 - mountPath: /var/run/secrets/kubernetes.io/serviceaccount 173 name: kube-api-access-wflz5 174 readOnly: true 175 dnsPolicy: ClusterFirst 176 enableServiceLinks: true 177 hostname: tensorflow-dist-mnist-ps-0 ## Add `hostname` filed 178 nodeName: volcano-control-plane 179 restartPolicy: Never 180 schedulerName: volcano 181 subdomain: tensorflow-dist-mnist ## Add `subdomain` filed 182 tolerations: 183 - effect: NoExecute 184 key: node.kubernetes.io/not-ready 185 operator: Exists 186 tolerationSeconds: 300 187 - effect: NoExecute 188 key: node.kubernetes.io/unreachable 189 operator: Exists 190 tolerationSeconds: 300 191 volumes: 192 - configMap: ## Configmap generated for the job 193 defaultMode: 420 194 name: tensorflow-dist-mnist-svc 195 name: tensorflow-dist-mnist-svc 196 status: 197 conditions: 198 - lastProbeTime: null 199 lastTransitionTime: "2022-04-13T02:08:17Z" 200 status: "True" 201 type: Initialized 202 - lastProbeTime: null 203 lastTransitionTime: "2022-04-13T02:08:18Z" 204 status: "True" 205 type: Ready 206 - lastProbeTime: null 207 lastTransitionTime: "2022-04-13T02:08:18Z" 208 status: "True" 209 type: ContainersReady 210 - lastProbeTime: null 211 lastTransitionTime: "2022-04-13T02:08:17Z" 212 status: "True" 213 type: PodScheduled 214 hostIP: x.x.x.x 215 phase: Running 216 podIP: x.x.x.x 217 podIPs: 218 - ip: x.x.x.x 219 qosClass: BestEffort 220 startTime: "2022-04-13T02:08:17Z" 221 ``` 222 * Host information is registered to all pods under the job. The following are registered environment variables for `ps` pod. 223 ``` 224 [root@tensorflow-dist-mnist-ps-0 /] env | grep VC 225 VC_PS_NUM=1 226 VC_PS_HOSTS=tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist ## ps pod domain 227 VC_WORKER_NUM=2 228 VC_WORKER_HOSTS=tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist,tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist ## worker pods domains 229 ``` 230 * The host files added under `/etc/volcano` are as follows. 231 ``` 232 [root@tensorflow-dist-mnist-ps-0 /] ls /etc/volcano/ 233 VC_PS_HOSTS VC_PS_NUM VC_WORKER_HOSTS VC_WORKER_NUM ps.host worker.host 234 [root@tensorflow-dist-mnist-ps-0 /]# cat /etc/volcano/ps.host 235 tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist 236 [root@tensorflow-dist-mnist-ps-0 /]# cat /etc/volcano/worker.host 237 tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist 238 tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist 239 ``` 240 * The headless service `tensorflow-dist-mnist` generated for the job is as follows. 241 ```yaml 242 apiVersion: v1 243 kind: Service 244 metadata: 245 creationTimestamp: "2022-04-13T02:08:15Z" 246 name: tensorflow-dist-mnist 247 namespace: default 248 ownerReferences: 249 - apiVersion: batch.volcano.sh/v1alpha1 250 blockOwnerDeletion: true 251 controller: true 252 kind: Job 253 name: tensorflow-dist-mnist 254 uid: 52c98cc2-4791-490f-8572-22df2c16ef8f 255 resourceVersion: "855341" 256 uid: a77cb081-72ae-442f-96da-e36974dfed48 257 spec: 258 clusterIP: None 259 clusterIPs: 260 - None 261 ipFamilies: 262 - IPv4 263 ipFamilyPolicy: SingleStack 264 selector: 265 volcano.sh/job-name: tensorflow-dist-mnist 266 volcano.sh/job-namespace: default 267 sessionAffinity: None 268 type: ClusterIP 269 status: 270 loadBalancer: {} 271 ``` 272 * The configmap `tensorflow-dist-mnist-svc` generated for the job is as follows. 273 ```yaml 274 apiVersion: v1 275 data: 276 VC_PS_HOSTS: tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist 277 VC_PS_NUM: "1" 278 VC_WORKER_HOSTS: tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist,tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist 279 VC_WORKER_NUM: "2" 280 ps.host: tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist 281 worker.host: |- 282 tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist 283 tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist 284 kind: ConfigMap 285 metadata: 286 creationTimestamp: "2022-04-13T02:08:15Z" 287 name: tensorflow-dist-mnist-svc 288 namespace: default 289 ownerReferences: 290 - apiVersion: batch.volcano.sh/v1alpha1 291 blockOwnerDeletion: true 292 controller: true 293 kind: Job 294 name: tensorflow-dist-mnist 295 uid: 52c98cc2-4791-490f-8572-22df2c16ef8f 296 resourceVersion: "855340" 297 uid: c4f3db21-6857-451f-b8b8-bbd5aa8b06ec 298 ``` 299 * The networkpolicy `tensorflow-dist-mnist` generated for the job is as follows. 300 ```yaml 301 apiVersion: networking.k8s.io/v1 302 kind: NetworkPolicy 303 metadata: 304 creationTimestamp: "2022-04-13T02:08:15Z" 305 name: tensorflow-dist-mnist 306 namespace: default 307 ownerReferences: 308 - apiVersion: batch.volcano.sh/v1alpha1 309 blockOwnerDeletion: true 310 controller: true 311 kind: Job 312 name: tensorflow-dist-mnist 313 uid: 52c98cc2-4791-490f-8572-22df2c16ef8f 314 resourceVersion: "855343" 315 uid: ddf8aada-51d7-47c1-99a0-5e0d8a913a4d 316 spec: 317 ingress: 318 - from: 319 - podSelector: 320 matchLabels: 321 volcano.sh/job-name: tensorflow-dist-mnist 322 volcano.sh/job-namespace: default 323 podSelector: 324 matchLabels: 325 volcano.sh/job-name: tensorflow-dist-mnist 326 volcano.sh/job-namespace: default 327 policyTypes: 328 - Ingress 329 ``` 330 ## Note 331 * DNS plugin is required in your Kubernetes cluster such as `corndns`. 332 * Kubernetes version >= v1.14