volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_cdp_plugin.md (about)

     1  # Cooldown Protection Plugin User Guide
     2  
     3  ## Background
     4  When we need to enable elastic training or serving, preemptible job's pods can be preempted or back to running repeatedly, if no cooldown protection set, these pods can be preempted again after they just started for a short time, this may cause service stability dropped.
     5  So we add "cdp" plugin to ensure preemptible job's pods can run for at least some time set by user.
     6  
     7  ## Environment setup
     8  
     9  ### Install volcano
    10  
    11  Refer to [Install Guide](../../installer/README.md) to install volcano.
    12  
    13  ### Update scheduler configmap
    14  
    15  After installed, update the scheduler configuration:
    16  
    17  ```shell
    18  kubectl edit configmap -n volcano-system volcano-scheduler-configmap
    19  ```
    20  
    21  Register `cdp` plugin in configmap while enable `preempt` action
    22  
    23  ```yaml
    24  kind: ConfigMap
    25  apiVersion: v1
    26  metadata:
    27    name: volcano-scheduler-configmap
    28    namespace: volcano-system
    29  data:
    30    volcano-scheduler.conf: |
    31      actions: "enqueue, allocate, preempt, backfill"
    32      tiers:
    33      - plugins:
    34        - name: priority
    35        - name: gang
    36        - name: conformance
    37        - name: cdp
    38      - plugins:
    39        - name: drf
    40        - name: predicates
    41        - name: task-topology
    42          arguments:
    43            task-topology.weight: 10
    44        - name: proportion
    45        - name: nodeorder
    46        - name: binpack
    47  ```
    48  
    49  ### Running Jobs
    50  
    51  Take a simple volcano job as sample.
    52  
    53  original job yaml is as below, which has "ps" and "worker" task
    54  
    55  ```yaml
    56  apiVersion: batch.volcano.sh/v1alpha1
    57  kind: Job
    58  metadata:
    59    name: test-job
    60  spec:
    61    minAvailable: 3
    62    schedulerName: volcano
    63    priorityClassName: high-priority
    64    plugins:
    65      ssh: []
    66      env: []
    67      svc: []
    68    maxRetry: 5
    69    queue: default
    70    volumes:
    71      - mountPath: "/myinput"
    72      - mountPath: "/myoutput"
    73        volumeClaimName: "testvolumeclaimname"
    74        volumeClaim:
    75          accessModes: [ "ReadWriteOnce" ]
    76          storageClassName: "my-storage-class"
    77          resources:
    78            requests:
    79              storage: 1Gi
    80    tasks:
    81      - replicas: 6
    82        name: "worker"
    83        template:
    84          metadata:
    85            name: worker
    86          spec:
    87            containers:
    88              - image: nginx
    89                imagePullPolicy: IfNotPresent
    90                name: nginx
    91                resources:
    92                  requests:
    93                    cpu: "1"
    94            restartPolicy: OnFailure
    95      - replicas: 2
    96        name: "ps"
    97        template:
    98          metadata:
    99            name: ps
   100          spec:
   101            containers:
   102              - image: nginx
   103                imagePullPolicy: IfNotPresent
   104                name: nginx
   105                resources:
   106                  requests:
   107                    cpu: "1"
   108            restartPolicy: OnFailure
   109  
   110  ```
   111  
   112  #### Edit yaml of vcjob
   113  
   114  1. add annotations in volcano job in format below.
   115     1. `volcano.sh/preemptable` annotation indicates that job or task is preemptable
   116     2. `volcano.sh/cooldown-time` annotation indicates cooldown time for the entire job or dedicated task. Value for the annotation indicates cooldown time, valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". 
   117  
   118          ```yaml
   119              volcano.sh/preemptable: "true"
   120              volcano.sh/cooldown-time: "600s"
   121          ```
   122  
   123  **Example 1**
   124  
   125  Add annotation to entire job, then "ps" and "worker" task can be preempted and all have cooldown time support.
   126  
   127  ```yaml
   128  apiVersion: batch.volcano.sh/v1alpha1
   129  kind: Job
   130  metadata:
   131    name: test-job
   132    annotations:
   133      volcano.sh/preemptable: "true"
   134      volcano.sh/cooldown-time: "600s"
   135  spec:
   136    ... # below keep the same
   137  ```
   138  
   139  **Example 2**
   140  
   141  Add annotation to dedicated task, as shown below, only "worker" can be preempted and have cooldown time support.
   142  
   143  ```yaml
   144  apiVersion: batch.volcano.sh/v1alpha1
   145  kind: Job
   146  metadata:
   147    name: test-job
   148  spec:
   149    minAvailable: 3
   150    schedulerName: volcano
   151    priorityClassName: high-priority
   152    plugins:
   153      ssh: []
   154      env: []
   155      svc: []
   156    maxRetry: 5
   157    queue: default
   158    volumes:
   159      - mountPath: "/myinput"
   160      - mountPath: "/myoutput"
   161        volumeClaimName: "testvolumeclaimname"
   162        volumeClaim:
   163          accessModes: [ "ReadWriteOnce" ]
   164          storageClassName: "my-storage-class"
   165          resources:
   166            requests:
   167              storage: 1Gi
   168    tasks:
   169      - replicas: 6
   170        name: "worker"
   171        template:
   172          metadata:
   173            name: worker
   174            annotations:     # add annotation in tasks
   175              volcano.sh/preemptable: "true"
   176              volcano.sh/cooldown-time: "600s"
   177          spec:
   178            containers:
   179              - image: nginx
   180                imagePullPolicy: IfNotPresent
   181                name: nginx
   182                resources:
   183                  requests:
   184                    cpu: "1"
   185            restartPolicy: OnFailure
   186      - replicas: 2
   187        name: "ps"
   188        template:
   189          metadata:
   190            name: ps
   191          spec:
   192            containers:
   193              - image: nginx
   194                imagePullPolicy: IfNotPresent
   195                name: nginx
   196                resources:
   197                  requests:
   198                    cpu: "1"
   199            restartPolicy: OnFailure
   200  
   201  ```