volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_cdp_plugin.md (about) 1 # Cooldown Protection Plugin User Guide 2 3 ## Background 4 When we need to enable elastic training or serving, preemptible job's pods can be preempted or back to running repeatedly, if no cooldown protection set, these pods can be preempted again after they just started for a short time, this may cause service stability dropped. 5 So we add "cdp" plugin to ensure preemptible job's pods can run for at least some time set by user. 6 7 ## Environment setup 8 9 ### Install volcano 10 11 Refer to [Install Guide](../../installer/README.md) to install volcano. 12 13 ### Update scheduler configmap 14 15 After installed, update the scheduler configuration: 16 17 ```shell 18 kubectl edit configmap -n volcano-system volcano-scheduler-configmap 19 ``` 20 21 Register `cdp` plugin in configmap while enable `preempt` action 22 23 ```yaml 24 kind: ConfigMap 25 apiVersion: v1 26 metadata: 27 name: volcano-scheduler-configmap 28 namespace: volcano-system 29 data: 30 volcano-scheduler.conf: | 31 actions: "enqueue, allocate, preempt, backfill" 32 tiers: 33 - plugins: 34 - name: priority 35 - name: gang 36 - name: conformance 37 - name: cdp 38 - plugins: 39 - name: drf 40 - name: predicates 41 - name: task-topology 42 arguments: 43 task-topology.weight: 10 44 - name: proportion 45 - name: nodeorder 46 - name: binpack 47 ``` 48 49 ### Running Jobs 50 51 Take a simple volcano job as sample. 52 53 original job yaml is as below, which has "ps" and "worker" task 54 55 ```yaml 56 apiVersion: batch.volcano.sh/v1alpha1 57 kind: Job 58 metadata: 59 name: test-job 60 spec: 61 minAvailable: 3 62 schedulerName: volcano 63 priorityClassName: high-priority 64 plugins: 65 ssh: [] 66 env: [] 67 svc: [] 68 maxRetry: 5 69 queue: default 70 volumes: 71 - mountPath: "/myinput" 72 - mountPath: "/myoutput" 73 volumeClaimName: "testvolumeclaimname" 74 volumeClaim: 75 accessModes: [ "ReadWriteOnce" ] 76 storageClassName: "my-storage-class" 77 resources: 78 requests: 79 storage: 1Gi 80 tasks: 81 - replicas: 6 82 name: "worker" 83 template: 84 metadata: 85 name: worker 86 spec: 87 containers: 88 - image: nginx 89 imagePullPolicy: IfNotPresent 90 name: nginx 91 resources: 92 requests: 93 cpu: "1" 94 restartPolicy: OnFailure 95 - replicas: 2 96 name: "ps" 97 template: 98 metadata: 99 name: ps 100 spec: 101 containers: 102 - image: nginx 103 imagePullPolicy: IfNotPresent 104 name: nginx 105 resources: 106 requests: 107 cpu: "1" 108 restartPolicy: OnFailure 109 110 ``` 111 112 #### Edit yaml of vcjob 113 114 1. add annotations in volcano job in format below. 115 1. `volcano.sh/preemptable` annotation indicates that job or task is preemptable 116 2. `volcano.sh/cooldown-time` annotation indicates cooldown time for the entire job or dedicated task. Value for the annotation indicates cooldown time, valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". 117 118 ```yaml 119 volcano.sh/preemptable: "true" 120 volcano.sh/cooldown-time: "600s" 121 ``` 122 123 **Example 1** 124 125 Add annotation to entire job, then "ps" and "worker" task can be preempted and all have cooldown time support. 126 127 ```yaml 128 apiVersion: batch.volcano.sh/v1alpha1 129 kind: Job 130 metadata: 131 name: test-job 132 annotations: 133 volcano.sh/preemptable: "true" 134 volcano.sh/cooldown-time: "600s" 135 spec: 136 ... # below keep the same 137 ``` 138 139 **Example 2** 140 141 Add annotation to dedicated task, as shown below, only "worker" can be preempted and have cooldown time support. 142 143 ```yaml 144 apiVersion: batch.volcano.sh/v1alpha1 145 kind: Job 146 metadata: 147 name: test-job 148 spec: 149 minAvailable: 3 150 schedulerName: volcano 151 priorityClassName: high-priority 152 plugins: 153 ssh: [] 154 env: [] 155 svc: [] 156 maxRetry: 5 157 queue: default 158 volumes: 159 - mountPath: "/myinput" 160 - mountPath: "/myoutput" 161 volumeClaimName: "testvolumeclaimname" 162 volumeClaim: 163 accessModes: [ "ReadWriteOnce" ] 164 storageClassName: "my-storage-class" 165 resources: 166 requests: 167 storage: 1Gi 168 tasks: 169 - replicas: 6 170 name: "worker" 171 template: 172 metadata: 173 name: worker 174 annotations: # add annotation in tasks 175 volcano.sh/preemptable: "true" 176 volcano.sh/cooldown-time: "600s" 177 spec: 178 containers: 179 - image: nginx 180 imagePullPolicy: IfNotPresent 181 name: nginx 182 resources: 183 requests: 184 cpu: "1" 185 restartPolicy: OnFailure 186 - replicas: 2 187 name: "ps" 188 template: 189 metadata: 190 name: ps 191 spec: 192 containers: 193 - image: nginx 194 imagePullPolicy: IfNotPresent 195 name: nginx 196 resources: 197 requests: 198 cpu: "1" 199 restartPolicy: OnFailure 200 201 ```