volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_svc_plugin.md (about)

     1  # Volcano Job Plugin -- SVC User Guidance
     2  
     3  ## Background
     4  **SVC Plugin** is designed for the communication for pods within a volcano job, which is essential for workloads such as
     5  [TensorFlow](https://tensorflow.google.cn/) and [MPI](https://www.open-mpi.org/). For example, it is necessary for
     6  `tensorflow job` to contact with each other between `ps` and `worker`. Volcano job plugin `svc` enable pods within a job
     7  to visit each other by domain.
     8  
     9  ## Key Points
    10  * Once `svc` plugin is configured, value of field `hostname` under `spec` will be filled out to be **the pod name** for
    11  all pods under the job automatically. Namely, `pod.spec.hostname` is `pod.metadata.name`.
    12  * Once `svc` plugin is configured, value of field `subdomain` under `spec` will be filled out to be **the job name** for
    13  all pods under the job automatically. Namely, `pod.spec.subdomain` is `job.metadata.name`.
    14  * Once `svc` plugin is configured, environment variables `VC_%s_NUM` will be registered to all the containers(including
    15  initContainers) under the job automatically. `%s` will be replaced by the **task name** which the pod belongs to. The value
    16  of the environment variable is the **task replicas**. The number of the environment variables depends on the number of tasks,
    17  which is usually `2` for most AI and Big Data jobs contains 2 roles. For example, a Spark job contains `driver` and `executor`.
    18  * Once `svc` plugin is configured, environment variables `VC_%s_HOSTS` will be registered to all the containers(including 
    19  initContainers) under the job automatically. `%s` will be replaced by the **task name** which the pod belongs to. The value
    20  of the environment variable are the domains of all the pods under the task. The number of the environment variables depends
    21  on the number of tasks, which is usually `2` for most AI and Big Data jobs contains 2 roles. For example, a TensorFlow job
    22  contains `ps` and `worker`.
    23  * A configmap whose name joins job-name and `svc` with `-` will be created automatically, which contains replicas of all
    24  tasks and domains of all pods under the task. It will be mounted as a volume for all pods under the job and serves as the
    25  host files under the directory `/etc/volcano/`.
    26  * A headless service whose name is the same with job will be created.
    27  * If `disable-network-policy` is set to be false, a `NetworkPolicy` object with the type `Ingress` will be created for
    28  the job.
    29  
    30  ## Arguments
    31  | ID  | Name                          | Value           | Default Value | Required | Description                                          | Example                                       |
    32  |-----|-------------------------------|-----------------|---------------|----------|------------------------------------------------------|-----------------------------------------------|
    33  | 1   | `publish-not-ready-addresses` | `true`/`false`  | `false`       | N        | whether publish the pod address when it is not ready | svc: ["--publish-not-ready-addresses=true"]   |
    34  | 2   | `disable-network-policy`      | `true`/`false`  | `false`       | N        | whether disable network policy for the job           | svc: ["--disable-network-policy=true"]        |
    35  
    36  ## Examples
    37  ```yaml
    38  apiVersion: batch.volcano.sh/v1alpha1
    39  kind: Job
    40  metadata:
    41    name: tensorflow-dist-mnist
    42  spec:
    43    minAvailable: 3
    44    schedulerName: volcano
    45    plugins:
    46      env: []
    47      svc: ["--publish-not-ready-addresses=false", "--disable-network-policy=false"]  ## SVC plugin register
    48    policies:
    49      - event: PodEvicted
    50        action: RestartJob
    51    queue: default
    52    tasks:
    53      - replicas: 1
    54        name: ps
    55        template:
    56          spec:
    57            containers:
    58              - command:
    59                  - sh
    60                  - -c
    61                  - |
    62                    PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;    ## Get host domain from host files generated
    63                    WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    64                    export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
    65                    python /var/tf_dist_mnist/dist_mnist.py
    66                image: volcanosh/dist-mnist-tf-example:0.0.1
    67                name: tensorflow
    68                ports:
    69                  - containerPort: 2222
    70                    name: tfjob-port
    71                resources: {}
    72            restartPolicy: Never
    73      - replicas: 2
    74        name: worker
    75        policies:
    76          - event: TaskCompleted
    77            action: CompleteJob
    78        template:
    79          spec:
    80            containers:
    81              - command:
    82                  - sh
    83                  - -c
    84                  - |
    85                    PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    86                    WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
    87                    export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
    88                    python /var/tf_dist_mnist/dist_mnist.py
    89                image: volcanosh/dist-mnist-tf-example:0.0.1
    90                name: tensorflow
    91                ports:
    92                  - containerPort: 2222
    93                    name: tfjob-port
    94                resources: {}
    95            restartPolicy: Never
    96  ```
    97  Note:
    98  * Fields `hostname` and `subdomain` have been added to the pods under job `tensorflow-dist-mnist`. The following is part
    99  yaml of the `ps` pod. 
   100  ```yaml
   101  apiVersion: v1
   102  kind: Pod
   103  metadata:
   104    annotations:
   105      scheduling.k8s.io/group-name: tensorflow-dist-mnist
   106      volcano.sh/job-name: tensorflow-dist-mnist
   107      volcano.sh/job-version: "0"
   108      volcano.sh/queue-name: default
   109      volcano.sh/task-spec: ps
   110      volcano.sh/template-uid: tensorflow-dist-mnist-ps
   111    labels:
   112      volcano.sh/job-name: tensorflow-dist-mnist
   113      volcano.sh/job-namespace: default
   114      volcano.sh/queue-name: default
   115      volcano.sh/task-spec: ps
   116    name: tensorflow-dist-mnist-ps-0
   117    namespace: default
   118    ownerReferences:
   119    - apiVersion: batch.volcano.sh/v1alpha1
   120      blockOwnerDeletion: true
   121      controller: true
   122      kind: Job
   123      name: tensorflow-dist-mnist
   124      uid: 52c98cc2-4791-490f-8572-22df2c16ef8f
   125    resourceVersion: "855403"
   126    uid: 1b9e834b-de7e-4760-9b23-2a673d38e5d9
   127  spec:
   128    containers:
   129    - command:
   130        - sh
   131          - -c
   132          - |
   133          PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;    ## Get host domain from host files generated
   134          WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
   135          export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
   136          python /var/tf_dist_mnist/dist_mnist.py
   137      env:
   138      - name: VK_TASK_INDEX
   139        value: "0"
   140      - name: VC_TASK_INDEX
   141        value: "0"
   142      - name: VC_PS_HOSTS     ## Environment variable `VC_PS_HOSTS` contains the domains of all the `ps` hosts. 
   143        valueFrom:
   144          configMapKeyRef:
   145            key: VC_PS_HOSTS
   146            name: tensorflow-dist-mnist-svc
   147      - name: VC_PS_NUM       ## Environment variable `VC_PS_NUM` contains the number of `ps` hosts.
   148        valueFrom:
   149          configMapKeyRef:
   150            key: VC_PS_NUM
   151            name: tensorflow-dist-mnist-svc
   152      - name: VC_WORKER_HOSTS   ## Environment variable `VC_WORKER_HOSTS` contains the domains of all the `worker` hosts.
   153        valueFrom:
   154          configMapKeyRef:
   155            key: VC_WORKER_HOSTS
   156            name: tensorflow-dist-mnist-svc
   157      - name: VC_WORKER_NUM     ## Environment variable `VC_WORKER_NUM` contains the number of `worker` hosts.
   158        valueFrom:
   159          configMapKeyRef:
   160            key: VC_WORKER_NUM
   161            name: tensorflow-dist-mnist-svc
   162      image: volcanosh/dist-mnist-tf-example:0.0.1
   163      name: tensorflow
   164      ports:
   165      - containerPort: 2222
   166        name: tfjob-port
   167        protocol: TCP
   168      resources: {}
   169      volumeMounts:   ## Mount the configmap generated for the job under `/etc/volcano`, which contains all host files. 
   170      - mountPath: /etc/volcano
   171        name: tensorflow-dist-mnist-svc
   172      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
   173        name: kube-api-access-wflz5
   174        readOnly: true
   175    dnsPolicy: ClusterFirst
   176    enableServiceLinks: true
   177    hostname: tensorflow-dist-mnist-ps-0    ## Add `hostname` filed
   178    nodeName: volcano-control-plane
   179    restartPolicy: Never
   180    schedulerName: volcano
   181    subdomain: tensorflow-dist-mnist        ## Add `subdomain` filed
   182    tolerations:
   183    - effect: NoExecute
   184      key: node.kubernetes.io/not-ready
   185      operator: Exists
   186      tolerationSeconds: 300
   187    - effect: NoExecute
   188      key: node.kubernetes.io/unreachable
   189      operator: Exists
   190      tolerationSeconds: 300
   191    volumes:
   192    - configMap:    ## Configmap generated for the job
   193        defaultMode: 420
   194        name: tensorflow-dist-mnist-svc
   195      name: tensorflow-dist-mnist-svc
   196  status:
   197    conditions:
   198    - lastProbeTime: null
   199      lastTransitionTime: "2022-04-13T02:08:17Z"
   200      status: "True"
   201      type: Initialized
   202    - lastProbeTime: null
   203      lastTransitionTime: "2022-04-13T02:08:18Z"
   204      status: "True"
   205      type: Ready
   206    - lastProbeTime: null
   207      lastTransitionTime: "2022-04-13T02:08:18Z"
   208      status: "True"
   209      type: ContainersReady
   210    - lastProbeTime: null
   211      lastTransitionTime: "2022-04-13T02:08:17Z"
   212      status: "True"
   213      type: PodScheduled
   214    hostIP: x.x.x.x
   215    phase: Running
   216    podIP: x.x.x.x
   217    podIPs:
   218    - ip: x.x.x.x
   219    qosClass: BestEffort
   220    startTime: "2022-04-13T02:08:17Z"
   221  ```
   222  * Host information is registered to all pods under the job. The following are registered environment variables for `ps` pod.
   223  ```
   224  [root@tensorflow-dist-mnist-ps-0 /] env | grep VC
   225  VC_PS_NUM=1
   226  VC_PS_HOSTS=tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist  ## ps pod domain
   227  VC_WORKER_NUM=2
   228  VC_WORKER_HOSTS=tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist,tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist  ## worker pods domains
   229  ```
   230  * The host files added under `/etc/volcano` are as follows.
   231  ```
   232  [root@tensorflow-dist-mnist-ps-0 /] ls /etc/volcano/
   233  VC_PS_HOSTS  VC_PS_NUM  VC_WORKER_HOSTS  VC_WORKER_NUM  ps.host  worker.host
   234  [root@tensorflow-dist-mnist-ps-0 /]# cat /etc/volcano/ps.host
   235  tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist
   236  [root@tensorflow-dist-mnist-ps-0 /]# cat /etc/volcano/worker.host 
   237  tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist
   238  tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist
   239  ```
   240  * The headless service `tensorflow-dist-mnist` generated for the job is as follows.
   241  ```yaml
   242  apiVersion: v1
   243  kind: Service
   244  metadata:
   245    creationTimestamp: "2022-04-13T02:08:15Z"
   246    name: tensorflow-dist-mnist
   247    namespace: default
   248    ownerReferences:
   249    - apiVersion: batch.volcano.sh/v1alpha1
   250      blockOwnerDeletion: true
   251      controller: true
   252      kind: Job
   253      name: tensorflow-dist-mnist
   254      uid: 52c98cc2-4791-490f-8572-22df2c16ef8f
   255    resourceVersion: "855341"
   256    uid: a77cb081-72ae-442f-96da-e36974dfed48
   257  spec:
   258    clusterIP: None
   259    clusterIPs:
   260    - None
   261    ipFamilies:
   262    - IPv4
   263    ipFamilyPolicy: SingleStack
   264    selector:
   265      volcano.sh/job-name: tensorflow-dist-mnist
   266      volcano.sh/job-namespace: default
   267    sessionAffinity: None
   268    type: ClusterIP
   269  status:
   270    loadBalancer: {}
   271  ```
   272  * The configmap `tensorflow-dist-mnist-svc` generated for the job is as follows.
   273  ```yaml
   274  apiVersion: v1
   275  data:
   276    VC_PS_HOSTS: tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist
   277    VC_PS_NUM: "1"
   278    VC_WORKER_HOSTS: tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist,tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist
   279    VC_WORKER_NUM: "2"
   280    ps.host: tensorflow-dist-mnist-ps-0.tensorflow-dist-mnist
   281    worker.host: |-
   282      tensorflow-dist-mnist-worker-0.tensorflow-dist-mnist
   283      tensorflow-dist-mnist-worker-1.tensorflow-dist-mnist
   284  kind: ConfigMap
   285  metadata:
   286    creationTimestamp: "2022-04-13T02:08:15Z"
   287    name: tensorflow-dist-mnist-svc
   288    namespace: default
   289    ownerReferences:
   290    - apiVersion: batch.volcano.sh/v1alpha1
   291      blockOwnerDeletion: true
   292      controller: true
   293      kind: Job
   294      name: tensorflow-dist-mnist
   295      uid: 52c98cc2-4791-490f-8572-22df2c16ef8f
   296    resourceVersion: "855340"
   297    uid: c4f3db21-6857-451f-b8b8-bbd5aa8b06ec
   298  ```
   299  * The networkpolicy `tensorflow-dist-mnist` generated for the job is as follows.
   300  ```yaml
   301  apiVersion: networking.k8s.io/v1
   302  kind: NetworkPolicy
   303  metadata:
   304    creationTimestamp: "2022-04-13T02:08:15Z"
   305    name: tensorflow-dist-mnist
   306    namespace: default
   307    ownerReferences:
   308    - apiVersion: batch.volcano.sh/v1alpha1
   309      blockOwnerDeletion: true
   310      controller: true
   311      kind: Job
   312      name: tensorflow-dist-mnist
   313      uid: 52c98cc2-4791-490f-8572-22df2c16ef8f
   314    resourceVersion: "855343"
   315    uid: ddf8aada-51d7-47c1-99a0-5e0d8a913a4d
   316  spec:
   317    ingress:
   318    - from:
   319      - podSelector:
   320          matchLabels:
   321            volcano.sh/job-name: tensorflow-dist-mnist
   322            volcano.sh/job-namespace: default
   323    podSelector:
   324      matchLabels:
   325        volcano.sh/job-name: tensorflow-dist-mnist
   326        volcano.sh/job-namespace: default
   327    policyTypes:
   328    - Ingress
   329  ```
   330  ## Note
   331  * DNS plugin is required in your Kubernetes cluster such as `corndns`.
   332  * Kubernetes version >= v1.14