sigs.k8s.io/kueue@v0.6.2/site/content/en/docs/tasks/setup_sequential_admission.md (about) 1 --- 2 title: "Sequential Admission with Ready Pods" 3 date: 2022-03-14 4 weight: 4 5 description: > 6 Simple implementation of the all-or-nothing scheduling 7 --- 8 9 Some jobs need all pods to be running at the same time to operate; for example, 10 synchronized distributed training or MPI-based jobs which require pod-to-pod 11 communication. On a default Kueue configuration, a pair of such jobs may deadlock 12 if the physical availability of resources do not match the configured quotas in 13 Kueue. The same pair of jobs could run to completion if their pods were scheduled 14 sequentially. 15 16 To address this requirement, in version 0.3.0 we introduced an opt-in mechanism 17 configured via the flag `waitForPodsReady` that provides a simple implementation 18 of the all-or-nothing scheduling. When enabled, admission of workloads is blocked 19 not only on the availability of quota, but also until all previously admitted 20 jobs have their pods scheduled (all pods are running or completed, up the the 21 level of the job parallelism). 22 23 This page shows you how to configure Kueue to use `waitForPodsReady`, which 24 is a simple implementation of the all-or-nothing scheduling. 25 The intended audience for this page are [batch administrators](/docs/tasks#batch-administrator). 26 27 ## Before you begin 28 29 Make sure the following conditions are met: 30 31 - A Kubernetes cluster is running. 32 - The kubectl command-line tool has communication with your cluster. 33 - [Kueue is installed](/docs/installation) in version 0.3.0 or later. 34 35 ## Enabling waitForPodsReady 36 37 Follow the instructions described 38 [here](/docs/installation#install-a-custom-configured-released-version) to 39 install a release version by extending the configuration with the following 40 fields: 41 42 ```yaml 43 waitForPodsReady: 44 enable: true 45 timeout: 10m 46 ``` 47 48 > **Note** 49 Note that, if you update an existing Kueue installation you may need to restart the 50 `kueue-controller-manager` pod in order for Kueue to pick up the updated 51 configuration. In that case run: 52 53 ```shell 54 kubectl delete pods --all -n kueue-system 55 ``` 56 57 The timeout (`waitForPodsReady.timeout`) is an optional parameter, defaulting to 58 5 minutes. 59 60 When the timeout expires for an admitted Workload, and the workload's 61 pods are not all scheduled yet (i.e., the Workload condition remains 62 `PodsReady=False`), then the Workload's admission is 63 cancelled, the corresponding job is suspended and the Workload is requeued. 64 65 ## Example 66 67 In this example we demonstrate the impact of enabling `waitForPodsReady` in Kueue. 68 We create two jobs which both require all their pods to be running at the same 69 time to complete. The cluster has enough resources to support running one of the 70 jobs at the same time, but not both. 71 72 > **Note** 73 > In this example we use a cluster with autoscaling disabled in order to simulate 74 issues with resource provisioning to satisfy the configured cluster quota. 75 76 ### 1. Preparation 77 78 First, check the amount of allocatable memory in your cluster. In many cases this 79 can be done with this command: 80 81 ```shell 82 TOTAL_ALLOCATABLE=$(kubectl get node --selector='!node-role.kubernetes.io/master,!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.status.allocatable.memory}{"\n"}{end}' | numfmt --from=auto | awk '{s+=$1} END {print s}') 83 echo $TOTAL_ALLOCATABLE 84 ``` 85 86 In our case this outputs `8838569984` which, for the purpose of the example, can 87 be approximated as `8429Mi`. 88 89 #### Configure ClusterQueue quotas 90 91 We configure the memory flavor by doubling the total memory allocatable in 92 our cluster, in order to simulate issues with provisioning. 93 94 Save the following cluster queues configuration as `cluster-queues.yaml`: 95 96 ``` yaml 97 apiVersion: kueue.x-k8s.io/v1beta1 98 kind: ResourceFlavor 99 metadata: 100 name: "default-flavor" 101 --- 102 apiVersion: kueue.x-k8s.io/v1beta1 103 kind: ClusterQueue 104 metadata: 105 name: "cluster-queue" 106 spec: 107 namespaceSelector: {} 108 resourceGroups: 109 - coveredResources: ["memory"] 110 flavors: 111 - name: "default-flavor" 112 resources: 113 - name: "memory" 114 nominalQuota: 16858Mi # double the value of allocatable memory in the cluster 115 --- 116 apiVersion: kueue.x-k8s.io/v1beta1 117 kind: LocalQueue 118 metadata: 119 namespace: "default" 120 name: "user-queue" 121 spec: 122 clusterQueue: "cluster-queue" 123 ``` 124 125 Then, apply the configuration by: 126 127 ```shell 128 kubectl apply -f cluster-queues.yaml 129 ``` 130 131 #### Prepare the job template 132 133 Save the following job template in the `job-template.yaml` file. Note the 134 `_ID_` placeholders which will be replaced to create configurations for the 135 two jobs. Also, pay attention to configure the memory field for the container 136 as be 75% of the total allocatable memory per pod. In our case this is 137 `75%*(8429Mi/20)=316Mi`. In this scenario there is not enough resources to 138 run all pods for both jobs at the same time, risking deadlock. 139 140 ```yaml 141 apiVersion: v1 142 kind: Service 143 metadata: 144 name: svc_ID_ 145 spec: 146 clusterIP: None 147 selector: 148 job-name: job_ID_ 149 ports: 150 - name: http 151 protocol: TCP 152 port: 8080 153 --- 154 apiVersion: v1 155 kind: ConfigMap 156 metadata: 157 name: script-code_ID_ 158 data: 159 main.py: | 160 from http.server import BaseHTTPRequestHandler, HTTPServer 161 from urllib.request import urlopen 162 import sys, os, time, logging 163 164 logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) 165 serverPort = 8080 166 INDEX_COUNT = int(sys.argv[1]) 167 index = int(os.environ.get('JOB_COMPLETION_INDEX')) 168 logger = logging.getLogger('LOG' + str(index)) 169 170 class WorkerServer(BaseHTTPRequestHandler): 171 def do_GET(self): 172 self.send_response(200) 173 self.end_headers() 174 if "exit" in self.path: 175 self.wfile.write(bytes("Exiting", "utf-8")) 176 self.wfile.close() 177 sys.exit(0) 178 else: 179 self.wfile.write(bytes("Running", "utf-8")) 180 181 def call_until_success(url): 182 while True: 183 try: 184 logger.info("Calling URL: " + url) 185 with urlopen(url) as response: 186 response_content = response.read().decode('utf-8') 187 logger.info("Response content from %s: %s" % (url, response_content)) 188 return 189 except Exception as e: 190 logger.warning("Got exception when calling %s: %s" % (url, e)) 191 time.sleep(1) 192 193 if __name__ == "__main__": 194 if index == 0: 195 for i in range(1, INDEX_COUNT): 196 call_until_success("http://job_ID_-%d.svc_ID_:8080/ping" % i) 197 logger.info("All workers running") 198 199 time.sleep(10) # sleep 10s to simulate doing something 200 201 for i in range(1, INDEX_COUNT): 202 call_until_success("http://job_ID_-%d.svc_ID_:8080/exit" % i) 203 logger.info("All workers stopped") 204 else: 205 webServer = HTTPServer(("", serverPort), WorkerServer) 206 logger.info("Server started at port %s" % serverPort) 207 webServer.serve_forever() 208 --- 209 210 apiVersion: batch/v1 211 kind: Job 212 metadata: 213 name: job_ID_ 214 labels: 215 kueue.x-k8s.io/queue-name: user-queue 216 spec: 217 parallelism: 20 218 completions: 20 219 completionMode: Indexed 220 suspend: true 221 template: 222 spec: 223 subdomain: svc_ID_ 224 volumes: 225 - name: script-volume 226 configMap: 227 name: script-code_ID_ 228 containers: 229 - name: main 230 image: python:bullseye 231 command: ["python"] 232 args: 233 - /script-path/main.py 234 - "20" 235 ports: 236 - containerPort: 8080 237 imagePullPolicy: IfNotPresent 238 resources: 239 requests: 240 memory: "316Mi" # choose the value as 75% * (total allocatable memory / 20) 241 volumeMounts: 242 - mountPath: /script-path 243 name: script-volume 244 restartPolicy: Never 245 backoffLimit: 0 246 ``` 247 248 #### Additional quick job 249 250 We also prepare an additional job to increase the variance in the timings to 251 make the deadlock more likely. Save the following yaml as `quick-job.yaml`: 252 253 ```yaml 254 apiVersion: batch/v1 255 kind: Job 256 metadata: 257 name: quick-job 258 annotations: 259 kueue.x-k8s.io/queue-name: user-queue 260 spec: 261 parallelism: 50 262 completions: 50 263 suspend: true 264 template: 265 spec: 266 restartPolicy: Never 267 containers: 268 - name: sleep 269 image: bash:5 270 command: ["bash"] 271 args: ["-c", 'echo "Hello world"'] 272 resources: 273 requests: 274 memory: "1" 275 backoffLimit: 0 276 ``` 277 278 ### 2. Induce a deadlock under the default configuration (optional) 279 280 #### Run the jobs 281 282 ```yaml 283 sed 's/_ID_/1/g' job-template.yaml > /tmp/job1.yaml 284 sed 's/_ID_/2/g' job-template.yaml > /tmp/job2.yaml 285 kubectl create -f quick-job.yaml 286 kubectl create -f /tmp/job1.yaml 287 kubectl create -f /tmp/job2.yaml 288 ``` 289 290 After a while check the status of the pods by 291 292 ```shell 293 kubectl get pods 294 ``` 295 296 The output is like this (omitting the pods of the `quick-job` for brevity): 297 298 ```shell 299 NAME READY STATUS RESTARTS AGE 300 job1-0-9pvs8 1/1 Running 0 28m 301 job1-1-w9zht 1/1 Running 0 28m 302 job1-10-fg99v 1/1 Running 0 28m 303 job1-11-4gspm 1/1 Running 0 28m 304 job1-12-w5jft 1/1 Running 0 28m 305 job1-13-8d5jk 1/1 Running 0 28m 306 job1-14-h5q8x 1/1 Running 0 28m 307 job1-15-kkv4j 0/1 Pending 0 28m 308 job1-16-frs8k 0/1 Pending 0 28m 309 job1-17-g78g8 0/1 Pending 0 28m 310 job1-18-2ghmt 0/1 Pending 0 28m 311 job1-19-4w2j5 0/1 Pending 0 28m 312 job1-2-9s486 1/1 Running 0 28m 313 job1-3-s9kh4 1/1 Running 0 28m 314 job1-4-52mj9 1/1 Running 0 28m 315 job1-5-bpjv5 1/1 Running 0 28m 316 job1-6-7f7tj 1/1 Running 0 28m 317 job1-7-pnq7w 1/1 Running 0 28m 318 job1-8-7s894 1/1 Running 0 28m 319 job1-9-kz4gt 1/1 Running 0 28m 320 job2-0-x6xvg 1/1 Running 0 28m 321 job2-1-flkpj 1/1 Running 0 28m 322 job2-10-vf4j9 1/1 Running 0 28m 323 job2-11-ktbld 0/1 Pending 0 28m 324 job2-12-sf4xb 1/1 Running 0 28m 325 job2-13-9j7lp 0/1 Pending 0 28m 326 job2-14-czc6l 1/1 Running 0 28m 327 job2-15-m77zt 0/1 Pending 0 28m 328 job2-16-7p7fs 0/1 Pending 0 28m 329 job2-17-sfdmj 0/1 Pending 0 28m 330 job2-18-cs4lg 0/1 Pending 0 28m 331 job2-19-x66dt 0/1 Pending 0 28m 332 job2-2-hnqjv 1/1 Running 0 28m 333 job2-3-pkwhw 1/1 Running 0 28m 334 job2-4-gdtsh 1/1 Running 0 28m 335 job2-5-6swdc 1/1 Running 0 28m 336 job2-6-qb6sp 1/1 Running 0 28m 337 job2-7-grcg4 0/1 Pending 0 28m 338 job2-8-kg568 1/1 Running 0 28m 339 job2-9-hvwj8 0/1 Pending 0 28m 340 ``` 341 342 These jobs are now deadlock'ed and are not going to be able to make progress. 343 344 #### Cleanup 345 346 Clean up the jobs by: 347 348 ```shell 349 kubectl delete -f quick-job.yaml 350 kubectl delete -f /tmp/job1.yaml 351 kubectl delete -f /tmp/job2.yaml 352 ``` 353 354 ### 3. Run with waitForPodsReady enabled 355 356 #### Enable waitForPodsReady 357 358 Update the Kueue configuration following the instructions [here](#enabling-waitforpodsready). 359 360 #### Run the jobs 361 362 Run the `start.sh` script 363 364 ```yaml 365 sed 's/_ID_/1/g' job-template.yaml > /tmp/job1.yaml 366 sed 's/_ID_/2/g' job-template.yaml > /tmp/job2.yaml 367 kubectl create -f quick-job.yaml 368 kubectl create -f /tmp/job1.yaml 369 kubectl create -f /tmp/job2.yaml 370 ``` 371 372 #### Monitor the progress 373 374 Execute the following command in a couple of seconds internals to monitor 375 the progress: 376 377 ```shell 378 kubectl get pods 379 ``` 380 381 We omit the pods of the completed `quick` job for brevity. 382 383 Output when `job1` is starting up, note that `job2` remains suspended: 384 385 ```shell 386 NAME READY STATUS RESTARTS AGE 387 job1-0-gc284 0/1 ContainerCreating 0 1s 388 job1-1-xz555 0/1 ContainerCreating 0 1s 389 job1-10-2ltws 0/1 Pending 0 1s 390 job1-11-r4778 0/1 ContainerCreating 0 1s 391 job1-12-xx8mn 0/1 Pending 0 1s 392 job1-13-glb8j 0/1 Pending 0 1s 393 job1-14-gnjpg 0/1 Pending 0 1s 394 job1-15-dzlqh 0/1 Pending 0 1s 395 job1-16-ljnj9 0/1 Pending 0 1s 396 job1-17-78tzv 0/1 Pending 0 1s 397 job1-18-4lhw2 0/1 Pending 0 1s 398 job1-19-hx6zv 0/1 Pending 0 1s 399 job1-2-hqlc6 0/1 ContainerCreating 0 1s 400 job1-3-zx55w 0/1 ContainerCreating 0 1s 401 job1-4-k2tb4 0/1 Pending 0 1s 402 job1-5-2zcw2 0/1 ContainerCreating 0 1s 403 job1-6-m2qzw 0/1 ContainerCreating 0 1s 404 job1-7-hgp9n 0/1 ContainerCreating 0 1s 405 job1-8-ss248 0/1 ContainerCreating 0 1s 406 job1-9-nwqmj 0/1 ContainerCreating 0 1s 407 ``` 408 409 Output when `job1` is running and `job2` is now unsuspended as `job` has all 410 the required resources assigned: 411 412 ```shell 413 NAME READY STATUS RESTARTS AGE 414 job1-0-gc284 1/1 Running 0 9s 415 job1-1-xz555 1/1 Running 0 9s 416 job1-10-2ltws 1/1 Running 0 9s 417 job1-11-r4778 1/1 Running 0 9s 418 job1-12-xx8mn 1/1 Running 0 9s 419 job1-13-glb8j 1/1 Running 0 9s 420 job1-14-gnjpg 1/1 Running 0 9s 421 job1-15-dzlqh 1/1 Running 0 9s 422 job1-16-ljnj9 1/1 Running 0 9s 423 job1-17-78tzv 1/1 Running 0 9s 424 job1-18-4lhw2 1/1 Running 0 9s 425 job1-19-hx6zv 1/1 Running 0 9s 426 job1-2-hqlc6 1/1 Running 0 9s 427 job1-3-zx55w 1/1 Running 0 9s 428 job1-4-k2tb4 1/1 Running 0 9s 429 job1-5-2zcw2 1/1 Running 0 9s 430 job1-6-m2qzw 1/1 Running 0 9s 431 job1-7-hgp9n 1/1 Running 0 9s 432 job1-8-ss248 1/1 Running 0 9s 433 job1-9-nwqmj 1/1 Running 0 9s 434 job2-0-djnjd 1/1 Running 0 3s 435 job2-1-trw7b 0/1 Pending 0 2s 436 job2-10-228cc 0/1 Pending 0 2s 437 job2-11-2ct8m 0/1 Pending 0 2s 438 job2-12-sxkqm 0/1 Pending 0 2s 439 job2-13-md92n 0/1 Pending 0 2s 440 job2-14-4v2ww 0/1 Pending 0 2s 441 job2-15-sph8h 0/1 Pending 0 2s 442 job2-16-2nvk2 0/1 Pending 0 2s 443 job2-17-f7g6z 0/1 Pending 0 2s 444 job2-18-9t9xd 0/1 Pending 0 2s 445 job2-19-tgf5c 0/1 Pending 0 2s 446 job2-2-9hcsd 0/1 Pending 0 2s 447 job2-3-557lt 0/1 Pending 0 2s 448 job2-4-k2d6b 0/1 Pending 0 2s 449 job2-5-nkkhx 0/1 Pending 0 2s 450 job2-6-5r76n 0/1 Pending 0 2s 451 job2-7-pmzb5 0/1 Pending 0 2s 452 job2-8-xdqtp 0/1 Pending 0 2s 453 job2-9-c4rcl 0/1 Pending 0 2s 454 ``` 455 456 Once `job1` completes, it frees the resources required by `job2` to run its pods 457 to make progress. Finally, all jobs complete. 458 459 #### Cleanup 460 461 Clean up the jobs by: 462 463 ```shell 464 kubectl delete -f quick-job.yaml 465 kubectl delete -f /tmp/job1.yaml 466 kubectl delete -f /tmp/job2.yaml 467 ``` 468 469 ## Drawbacks 470 471 When enabling `waitForPodsReady`, the admission of Workloads may 472 be unnecessarily slowed down by sequencing in case the cluster has enough 473 resources to support concurrent Workload startup.