sigs.k8s.io/kueue@v0.6.2/site/content/en/docs/tasks/setup_sequential_admission.md

sigs.k8s.io/kueue@v0.6.2/site/content/en/docs/tasks/setup_sequential_admission.md (about)

     1  ---
     2  title: "Sequential Admission with Ready Pods"
     3  date: 2022-03-14
     4  weight: 4
     5  description: >
     6    Simple implementation of the all-or-nothing scheduling
     7  ---
     8  
     9  Some jobs need all pods to be running at the same time to operate; for example,
    10  synchronized distributed training or MPI-based jobs which require pod-to-pod
    11  communication. On a default Kueue configuration, a pair of such jobs may deadlock
    12  if the physical availability of resources do not match the configured quotas in
    13  Kueue. The same pair of jobs could run to completion if their pods were scheduled
    14  sequentially.
    15  
    16  To address this requirement, in version 0.3.0 we introduced an opt-in mechanism
    17  configured via the flag `waitForPodsReady` that provides a simple implementation
    18  of the all-or-nothing scheduling. When enabled, admission of workloads is blocked
    19  not only on the availability of quota, but also until all previously admitted
    20  jobs have their pods scheduled (all pods are running or completed, up the the
    21  level of the job parallelism).
    22  
    23  This page shows you how to configure Kueue to use `waitForPodsReady`, which
    24  is a simple implementation of the all-or-nothing scheduling.
    25  The intended audience for this page are [batch administrators](/docs/tasks#batch-administrator).
    26  
    27  ## Before you begin
    28  
    29  Make sure the following conditions are met:
    30  
    31  - A Kubernetes cluster is running.
    32  - The kubectl command-line tool has communication with your cluster.
    33  - [Kueue is installed](/docs/installation) in version 0.3.0 or later.
    34  
    35  ## Enabling waitForPodsReady
    36  
    37  Follow the instructions described
    38  [here](/docs/installation#install-a-custom-configured-released-version) to
    39  install a release version by extending the configuration with the following
    40  fields:
    41  
    42  ```yaml
    43      waitForPodsReady:
    44        enable: true
    45        timeout: 10m
    46  ```
    47  
    48  > **Note**
    49  Note that, if you update an existing Kueue installation you may need to restart the
    50  `kueue-controller-manager` pod in order for Kueue to pick up the updated
    51  configuration. In that case run:
    52  
    53  ```shell
    54  kubectl delete pods --all -n kueue-system
    55  ```
    56  
    57  The timeout (`waitForPodsReady.timeout`) is an optional parameter, defaulting to
    58  5 minutes.
    59  
    60  When the timeout expires for an admitted Workload, and the workload's
    61  pods are not all scheduled yet (i.e., the Workload condition remains
    62  `PodsReady=False`), then the Workload's admission is
    63  cancelled, the corresponding job is suspended and the Workload is requeued.
    64  
    65  ## Example
    66  
    67  In this example we demonstrate the impact of enabling `waitForPodsReady` in Kueue.
    68  We create two jobs which both require all their pods to be running at the same
    69  time to complete. The cluster has enough resources to support running one of the
    70  jobs at the same time, but not both.
    71  
    72  > **Note**
    73  > In this example we use a cluster with autoscaling disabled in order to simulate
    74  issues with resource provisioning to satisfy the configured cluster quota.
    75  
    76  ### 1. Preparation
    77  
    78  First, check the amount of allocatable memory in your cluster. In many cases this
    79  can be done with this command:
    80  
    81  ```shell
    82  TOTAL_ALLOCATABLE=$(kubectl get node --selector='!node-role.kubernetes.io/master,!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.status.allocatable.memory}{"\n"}{end}' | numfmt --from=auto | awk '{s+=$1} END {print s}')
    83  echo $TOTAL_ALLOCATABLE
    84  ```
    85  
    86  In our case this outputs `8838569984` which, for the purpose of the example, can
    87  be approximated as `8429Mi`.
    88  
    89  #### Configure ClusterQueue quotas
    90  
    91  We configure the memory flavor by doubling the total memory allocatable in
    92  our cluster, in order to simulate issues with provisioning.
    93  
    94  Save the following cluster queues configuration as `cluster-queues.yaml`:
    95  
    96  ``` yaml
    97  apiVersion: kueue.x-k8s.io/v1beta1
    98  kind: ResourceFlavor
    99  metadata:
   100    name: "default-flavor"
   101  ---
   102  apiVersion: kueue.x-k8s.io/v1beta1
   103  kind: ClusterQueue
   104  metadata:
   105    name: "cluster-queue"
   106  spec:
   107    namespaceSelector: {}
   108    resourceGroups:
   109    - coveredResources: ["memory"]
   110      flavors:
   111      - name: "default-flavor"
   112        resources:
   113        - name: "memory"
   114          nominalQuota: 16858Mi # double the value of allocatable memory in the cluster         
   115  ---
   116  apiVersion: kueue.x-k8s.io/v1beta1
   117  kind: LocalQueue
   118  metadata:
   119    namespace: "default"
   120    name: "user-queue"
   121  spec:
   122    clusterQueue: "cluster-queue"
   123  ```
   124  
   125  Then, apply the configuration by:
   126  
   127  ```shell
   128  kubectl apply -f cluster-queues.yaml
   129  ```
   130  
   131  #### Prepare the job template
   132  
   133  Save the following job template in the `job-template.yaml` file. Note the
   134  `_ID_` placeholders which will be replaced to create configurations for the
   135  two jobs. Also, pay attention to configure the memory field for the container
   136  as be 75% of the total allocatable memory per pod. In our case this is
   137  `75%*(8429Mi/20)=316Mi`. In this scenario there is not enough resources to
   138  run all pods for both jobs at the same time, risking deadlock.
   139  
   140  ```yaml
   141  apiVersion: v1
   142  kind: Service
   143  metadata:
   144    name: svc_ID_
   145  spec:
   146    clusterIP: None
   147    selector:
   148      job-name: job_ID_
   149    ports:
   150    - name: http
   151      protocol: TCP
   152      port: 8080
   153  ---
   154  apiVersion: v1
   155  kind: ConfigMap
   156  metadata:
   157    name: script-code_ID_
   158  data:
   159    main.py: |
   160      from http.server import BaseHTTPRequestHandler, HTTPServer
   161      from urllib.request import urlopen
   162      import sys, os, time, logging
   163  
   164      logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
   165      serverPort = 8080
   166      INDEX_COUNT = int(sys.argv[1])
   167      index = int(os.environ.get('JOB_COMPLETION_INDEX'))
   168      logger = logging.getLogger('LOG' + str(index))
   169  
   170      class WorkerServer(BaseHTTPRequestHandler):
   171          def do_GET(self):
   172              self.send_response(200)
   173              self.end_headers()
   174              if "exit" in self.path:
   175                self.wfile.write(bytes("Exiting", "utf-8"))
   176                self.wfile.close()
   177                sys.exit(0)
   178              else:
   179                self.wfile.write(bytes("Running", "utf-8"))
   180  
   181      def call_until_success(url):
   182        while True:
   183          try:
   184            logger.info("Calling URL: " + url)
   185            with urlopen(url) as response:
   186              response_content = response.read().decode('utf-8')
   187              logger.info("Response content from %s: %s" % (url, response_content))
   188              return
   189          except Exception as e:
   190            logger.warning("Got exception when calling %s: %s" % (url, e))
   191          time.sleep(1)
   192  
   193      if __name__ == "__main__":
   194        if index == 0:
   195          for i in range(1, INDEX_COUNT):
   196            call_until_success("http://job_ID_-%d.svc_ID_:8080/ping" % i)
   197          logger.info("All workers running")
   198  
   199          time.sleep(10) # sleep 10s to simulate doing something
   200  
   201          for i in range(1, INDEX_COUNT):
   202            call_until_success("http://job_ID_-%d.svc_ID_:8080/exit" % i)
   203          logger.info("All workers stopped")
   204        else:
   205          webServer = HTTPServer(("", serverPort), WorkerServer)
   206          logger.info("Server started at port %s" % serverPort)
   207          webServer.serve_forever()
   208  ---
   209  
   210  apiVersion: batch/v1
   211  kind: Job
   212  metadata:
   213    name: job_ID_
   214    labels:
   215      kueue.x-k8s.io/queue-name: user-queue
   216  spec:
   217    parallelism: 20
   218    completions: 20
   219    completionMode: Indexed
   220    suspend: true
   221    template:
   222      spec:
   223        subdomain: svc_ID_
   224        volumes:
   225        - name: script-volume
   226          configMap:
   227            name: script-code_ID_
   228        containers:
   229        - name: main
   230          image: python:bullseye
   231          command: ["python"]
   232          args:
   233          - /script-path/main.py
   234          - "20"
   235          ports:
   236          - containerPort: 8080
   237          imagePullPolicy: IfNotPresent
   238          resources:
   239            requests:
   240              memory: "316Mi" # choose the value as 75% * (total allocatable memory / 20)
   241          volumeMounts:
   242            - mountPath: /script-path
   243              name: script-volume
   244        restartPolicy: Never
   245    backoffLimit: 0
   246  ```
   247  
   248  #### Additional quick job
   249  
   250  We also prepare an additional job to increase the variance in the timings to
   251  make the deadlock more likely. Save the following yaml as `quick-job.yaml`:
   252  
   253  ```yaml
   254  apiVersion: batch/v1
   255  kind: Job
   256  metadata:
   257    name: quick-job
   258    annotations:
   259      kueue.x-k8s.io/queue-name: user-queue
   260  spec:
   261    parallelism: 50
   262    completions: 50
   263    suspend: true
   264    template:
   265      spec:
   266        restartPolicy: Never
   267        containers:
   268        - name: sleep
   269          image: bash:5
   270          command: ["bash"]
   271          args: ["-c", 'echo "Hello world"']
   272          resources:
   273            requests:
   274              memory: "1"
   275    backoffLimit: 0
   276  ```
   277  
   278  ### 2. Induce a deadlock under the default configuration (optional)
   279  
   280  #### Run the jobs
   281  
   282  ```yaml
   283  sed 's/_ID_/1/g' job-template.yaml > /tmp/job1.yaml
   284  sed 's/_ID_/2/g' job-template.yaml > /tmp/job2.yaml
   285  kubectl create -f quick-job.yaml
   286  kubectl create -f /tmp/job1.yaml
   287  kubectl create -f /tmp/job2.yaml
   288  ```
   289  
   290  After a while check the status of the pods by
   291  
   292  ```shell
   293  kubectl get pods
   294  ```
   295  
   296  The output is like this (omitting the pods of the `quick-job` for brevity):
   297  
   298  ```shell
   299  NAME            READY   STATUS      RESTARTS   AGE
   300  job1-0-9pvs8    1/1     Running     0          28m
   301  job1-1-w9zht    1/1     Running     0          28m
   302  job1-10-fg99v   1/1     Running     0          28m
   303  job1-11-4gspm   1/1     Running     0          28m
   304  job1-12-w5jft   1/1     Running     0          28m
   305  job1-13-8d5jk   1/1     Running     0          28m
   306  job1-14-h5q8x   1/1     Running     0          28m
   307  job1-15-kkv4j   0/1     Pending     0          28m
   308  job1-16-frs8k   0/1     Pending     0          28m
   309  job1-17-g78g8   0/1     Pending     0          28m
   310  job1-18-2ghmt   0/1     Pending     0          28m
   311  job1-19-4w2j5   0/1     Pending     0          28m
   312  job1-2-9s486    1/1     Running     0          28m
   313  job1-3-s9kh4    1/1     Running     0          28m
   314  job1-4-52mj9    1/1     Running     0          28m
   315  job1-5-bpjv5    1/1     Running     0          28m
   316  job1-6-7f7tj    1/1     Running     0          28m
   317  job1-7-pnq7w    1/1     Running     0          28m
   318  job1-8-7s894    1/1     Running     0          28m
   319  job1-9-kz4gt    1/1     Running     0          28m
   320  job2-0-x6xvg    1/1     Running     0          28m
   321  job2-1-flkpj    1/1     Running     0          28m
   322  job2-10-vf4j9   1/1     Running     0          28m
   323  job2-11-ktbld   0/1     Pending     0          28m
   324  job2-12-sf4xb   1/1     Running     0          28m
   325  job2-13-9j7lp   0/1     Pending     0          28m
   326  job2-14-czc6l   1/1     Running     0          28m
   327  job2-15-m77zt   0/1     Pending     0          28m
   328  job2-16-7p7fs   0/1     Pending     0          28m
   329  job2-17-sfdmj   0/1     Pending     0          28m
   330  job2-18-cs4lg   0/1     Pending     0          28m
   331  job2-19-x66dt   0/1     Pending     0          28m
   332  job2-2-hnqjv    1/1     Running     0          28m
   333  job2-3-pkwhw    1/1     Running     0          28m
   334  job2-4-gdtsh    1/1     Running     0          28m
   335  job2-5-6swdc    1/1     Running     0          28m
   336  job2-6-qb6sp    1/1     Running     0          28m
   337  job2-7-grcg4    0/1     Pending     0          28m
   338  job2-8-kg568    1/1     Running     0          28m
   339  job2-9-hvwj8    0/1     Pending     0          28m
   340  ```
   341  
   342  These jobs are now deadlock'ed and are not going to be able to make progress.
   343  
   344  #### Cleanup
   345  
   346  Clean up the jobs by:
   347  
   348  ```shell
   349  kubectl delete -f quick-job.yaml
   350  kubectl delete -f /tmp/job1.yaml
   351  kubectl delete -f /tmp/job2.yaml
   352  ```
   353  
   354  ### 3. Run with waitForPodsReady enabled
   355  
   356  #### Enable waitForPodsReady
   357  
   358  Update the Kueue configuration following the instructions [here](#enabling-waitforpodsready).
   359  
   360  #### Run the jobs
   361  
   362  Run the `start.sh` script
   363  
   364  ```yaml
   365  sed 's/_ID_/1/g' job-template.yaml > /tmp/job1.yaml
   366  sed 's/_ID_/2/g' job-template.yaml > /tmp/job2.yaml
   367  kubectl create -f quick-job.yaml
   368  kubectl create -f /tmp/job1.yaml
   369  kubectl create -f /tmp/job2.yaml
   370  ```
   371  
   372  #### Monitor the progress
   373  
   374  Execute the following command in a couple of seconds internals to monitor
   375  the progress:
   376  
   377  ```shell
   378  kubectl get pods
   379  ```
   380  
   381  We omit the pods of the completed `quick` job for brevity.
   382  
   383  Output when `job1` is starting up, note that `job2` remains suspended:
   384  
   385  ```shell
   386  NAME            READY   STATUS              RESTARTS   AGE
   387  job1-0-gc284    0/1     ContainerCreating   0          1s
   388  job1-1-xz555    0/1     ContainerCreating   0          1s
   389  job1-10-2ltws   0/1     Pending             0          1s
   390  job1-11-r4778   0/1     ContainerCreating   0          1s
   391  job1-12-xx8mn   0/1     Pending             0          1s
   392  job1-13-glb8j   0/1     Pending             0          1s
   393  job1-14-gnjpg   0/1     Pending             0          1s
   394  job1-15-dzlqh   0/1     Pending             0          1s
   395  job1-16-ljnj9   0/1     Pending             0          1s
   396  job1-17-78tzv   0/1     Pending             0          1s
   397  job1-18-4lhw2   0/1     Pending             0          1s
   398  job1-19-hx6zv   0/1     Pending             0          1s
   399  job1-2-hqlc6    0/1     ContainerCreating   0          1s
   400  job1-3-zx55w    0/1     ContainerCreating   0          1s
   401  job1-4-k2tb4    0/1     Pending             0          1s
   402  job1-5-2zcw2    0/1     ContainerCreating   0          1s
   403  job1-6-m2qzw    0/1     ContainerCreating   0          1s
   404  job1-7-hgp9n    0/1     ContainerCreating   0          1s
   405  job1-8-ss248    0/1     ContainerCreating   0          1s
   406  job1-9-nwqmj    0/1     ContainerCreating   0          1s
   407  ```
   408  
   409  Output when `job1` is running and `job2` is now unsuspended as `job` has all
   410  the required resources assigned:
   411  
   412  ```shell
   413  NAME            READY   STATUS      RESTARTS   AGE
   414  job1-0-gc284    1/1     Running     0          9s
   415  job1-1-xz555    1/1     Running     0          9s
   416  job1-10-2ltws   1/1     Running     0          9s
   417  job1-11-r4778   1/1     Running     0          9s
   418  job1-12-xx8mn   1/1     Running     0          9s
   419  job1-13-glb8j   1/1     Running     0          9s
   420  job1-14-gnjpg   1/1     Running     0          9s
   421  job1-15-dzlqh   1/1     Running     0          9s
   422  job1-16-ljnj9   1/1     Running     0          9s
   423  job1-17-78tzv   1/1     Running     0          9s
   424  job1-18-4lhw2   1/1     Running     0          9s
   425  job1-19-hx6zv   1/1     Running     0          9s
   426  job1-2-hqlc6    1/1     Running     0          9s
   427  job1-3-zx55w    1/1     Running     0          9s
   428  job1-4-k2tb4    1/1     Running     0          9s
   429  job1-5-2zcw2    1/1     Running     0          9s
   430  job1-6-m2qzw    1/1     Running     0          9s
   431  job1-7-hgp9n    1/1     Running     0          9s
   432  job1-8-ss248    1/1     Running     0          9s
   433  job1-9-nwqmj    1/1     Running     0          9s
   434  job2-0-djnjd    1/1     Running     0          3s
   435  job2-1-trw7b    0/1     Pending     0          2s
   436  job2-10-228cc   0/1     Pending     0          2s
   437  job2-11-2ct8m   0/1     Pending     0          2s
   438  job2-12-sxkqm   0/1     Pending     0          2s
   439  job2-13-md92n   0/1     Pending     0          2s
   440  job2-14-4v2ww   0/1     Pending     0          2s
   441  job2-15-sph8h   0/1     Pending     0          2s
   442  job2-16-2nvk2   0/1     Pending     0          2s
   443  job2-17-f7g6z   0/1     Pending     0          2s
   444  job2-18-9t9xd   0/1     Pending     0          2s
   445  job2-19-tgf5c   0/1     Pending     0          2s
   446  job2-2-9hcsd    0/1     Pending     0          2s
   447  job2-3-557lt    0/1     Pending     0          2s
   448  job2-4-k2d6b    0/1     Pending     0          2s
   449  job2-5-nkkhx    0/1     Pending     0          2s
   450  job2-6-5r76n    0/1     Pending     0          2s
   451  job2-7-pmzb5    0/1     Pending     0          2s
   452  job2-8-xdqtp    0/1     Pending     0          2s
   453  job2-9-c4rcl    0/1     Pending     0          2s
   454  ```
   455  
   456  Once `job1` completes, it frees the resources required by `job2` to run its pods
   457  to make progress. Finally, all jobs complete.
   458  
   459  #### Cleanup
   460  
   461  Clean up the jobs by:
   462  
   463  ```shell
   464  kubectl delete -f quick-job.yaml
   465  kubectl delete -f /tmp/job1.yaml
   466  kubectl delete -f /tmp/job2.yaml
   467  ```
   468  
   469  ## Drawbacks
   470  
   471  When enabling `waitForPodsReady`, the admission of Workloads may
   472  be unnecessarily slowed down by sequencing in case the cluster has enough
   473  resources to support concurrent Workload startup.