sigs.k8s.io/kueue@v0.6.2/site/content/en/docs/tasks/run_python_jobs.md (about)

     1  ---
     2  title: "Run Jobs Using Python"
     3  date: 2023-07-05
     4  weight: 7
     5  description: >
     6    Run Kueue jobs programmatically with Python
     7  ---
     8  
     9  This guide is for [batch users](/docs/tasks#batch-user) that have a basic understanding of interacting with Kubernetes from Python. For more information, see [Kueue's overview](/docs/overview).
    10  
    11  ## Before you begin
    12  
    13  Check [administer cluster quotas](/docs/tasks/administer_cluster_quotas) for details on the initial cluster setup.
    14  You'll also need kubernetes python installed. We recommend a virtual environment.
    15  
    16  ```bash
    17  python -m venv env
    18  source env/bin/activate
    19  pip install kubernetes requests
    20  ```
    21  
    22  Note that the following versions were used for developing these examples:
    23  
    24   - **Python**: 3.9.12
    25   - **kubernetes**: 26.1.0
    26   - **requests**: 2.31.0
    27  
    28  You can either follow the [install instructions](https://github.com/kubernetes-sigs/kueue#installation) for Kueue, or use the install example, below.
    29  
    30  ## Kueue in Python
    31  
    32  Kueue at the core is a controller for a [Custom Resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/), and so to interact with it from Python we don't need a custom SDK, but rather we can use the generic functions provided by the
    33  [Kubernetes Python](https://github.com/kubernetes-client/python) library. In this guide, we provide several examples
    34  for interacting with Kueue in this fashion. If you would like to request a new example or would like help for a specific use
    35  case, please [open an issue](https://github.com/kubernetes-sigs/kueue/issues).
    36  
    37  ## Examples
    38  
    39  The following examples demonstrate different use cases for using Kueue in Python.
    40  
    41  ### Install Kueue
    42  
    43  This example demonstrates installing Kueue to an existing cluster. You can save this
    44  script to your local machine as `install-kueue-queues.py`. 
    45  
    46  {{< include file="examples/python/install-kueue-queues.py" lang="python" >}}
    47  
    48  And then run as follows:
    49  
    50  ```bash
    51  python install-kueue-queues.py 
    52  ```
    53  
    54  ```console
    55  ⭐️ Installing Kueue...
    56  ⭐️ Applying queues from single-clusterqueue-setup.yaml...
    57  ```
    58  
    59  You can also target a specific version:
    60  
    61  ```bash
    62  python install-kueue-queues.py --version {{< param "version" >}}
    63  ```
    64  
    65  ### Sample Job
    66  
    67  For the next example, let's start with a cluster with Kueue installed, and first create our queues:
    68  
    69  {{< include file="examples/python/sample-job.py" code="true" lang="python" >}}
    70  
    71  And run as follows:
    72  
    73  ```bash
    74  python sample-job.py
    75  ```
    76  ```console
    77  📦️ Container image selected is gcr.io/k8s-staging-perf-tests/sleep:v0.1.0...
    78  ⭐️ Creating sample job with prefix sample-job-...
    79  Use:
    80  "kubectl get queue" to see queue assignment
    81  "kubectl get jobs" to see jobs
    82  ```
    83  
    84  or try changing the name (`generateName`) of the job:
    85  
    86  ```bash
    87  python sample-job.py --job-name sleep-job-
    88  ```
    89  
    90  ```console
    91  📦️ Container image selected is gcr.io/k8s-staging-perf-tests/sleep:v0.1.0...
    92  ⭐️ Creating sample job with prefix sleep-job-...
    93  Use:
    94  "kubectl get queue" to see queue assignment
    95  "kubectl get jobs" to see jobs
    96  ```
    97  
    98  You can also change the container image with `--image` and args with `--args`.
    99  For more customization, you can edit the example script.
   100  
   101  ### Interact with Queues and Jobs
   102  
   103  If you are developing an application that submits jobs and needs to interact
   104  with and check on them, you likely want to interact with queues or jobs directly.
   105  After running the example above, you can test the following example to interact
   106  with the results. Write the following to a script called `sample-queue-control.py`.
   107  
   108  {{< include file="examples/python/sample-queue-control.py" lang="python" >}}
   109  
   110  To make the output more interesting, we can run a few random jobs first:
   111  
   112  ```bash
   113  python sample-job.py
   114  python sample-job.py
   115  python sample-job.py --job-name tacos
   116  ```
   117  
   118  And then run the script to see your queue and sample job that you submit previously.
   119  
   120  ```bash
   121  python sample-queue-control.py
   122  ```
   123  ```console
   124  ⛑️  Local Queues
   125  Found queue user-queue
   126    Admitted workloads: 3
   127    Pending workloads: 0
   128    Flavor default-flavor has resources [{'name': 'cpu', 'total': '3'}, {'name': 'memory', 'total': '600Mi'}]
   129  
   130  💼️ Jobs
   131  Found job sample-job-8n5sb
   132    Succeeded: 3
   133    Ready: 0
   134  Found job sample-job-gnxtl
   135    Succeeded: 1
   136    Ready: 0
   137  Found job tacos46bqw
   138    Succeeded: 1
   139    Ready: 1
   140  ```
   141  
   142  If you wanted to filter jobs to a specific queue, you can do this via the job labels
   143  under `job["metadata"]["labels"]["kueue.x-k8s.io/queue-name"]'. To list a specific job by
   144  name, you can do:
   145  
   146  ```python
   147  from kubernetes import client, config
   148  
   149  # Interact with batch
   150  config.load_kube_config()
   151  batch_api = client.BatchV1Api()
   152  
   153  # This is providing the name, and namespace
   154  job = batch_api.read_namespaced_job("tacos46bqw", "default")
   155  print(job)
   156  ```
   157  
   158  See the [BatchV1](https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/BatchV1Api.md)
   159  API documentation for more calls.
   160  
   161  
   162  ### Flux Operator Job
   163  
   164  For this example, we will be using the [Flux Operator](https://github.com/flux-framework/flux-operator)
   165  to submit a job, and specifically using the [Python SDK](https://github.com/flux-framework/flux-operator/tree/main/sdk/python/v1alpha1) to do this easily. Given our Python environment created in the [setup](#before-you-begin), we can install this Python SDK directly to it as follows:
   166  
   167  ```bash
   168  pip install fluxoperator
   169  ```
   170  
   171  We will also need to [install the Flux operator](https://flux-framework.org/flux-operator/getting_started/user-guide.html#quick-install). 
   172  
   173  ```bash
   174  kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/main/examples/dist/flux-operator.yaml
   175  ```
   176  
   177  Write the following script to `sample-flux-operator-job.py`:
   178  
   179  {{< include file="examples/python/sample-flux-operator-job.py" lang="python" >}}
   180  
   181  Now try running the example:
   182  
   183  ```bash
   184  python sample-flux-operator-job.py
   185  ```
   186  ```console
   187  📦️ Container image selected is ghcr.io/flux-framework/flux-restful-api...
   188  ⭐️ Creating sample job with prefix hello-world...
   189  Use:
   190  "kubectl get queue" to see queue assignment
   191  "kubectl get pods" to see pods
   192  ```
   193  
   194  You'll be able to almost immediately see the MiniCluster job admitted to the local queue:
   195  
   196  ```bash
   197  kubectl get queue
   198  ```
   199  ```console
   200  NAME         CLUSTERQUEUE    PENDING WORKLOADS   ADMITTED WORKLOADS
   201  user-queue   cluster-queue   0                   1
   202  ```
   203  
   204  And the 4 pods running (we are creating a networked cluster with 4 nodes):
   205  
   206  ```bash
   207  kubectl get pods
   208  ```
   209  ```console
   210  NAME                       READY   STATUS      RESTARTS   AGE
   211  hello-world7qgqd-0-wp596   1/1     Running     0          7s
   212  hello-world7qgqd-1-d7r87   1/1     Running     0          7s
   213  hello-world7qgqd-2-rfn4t   1/1     Running     0          7s
   214  hello-world7qgqd-3-blvtn   1/1     Running     0          7s
   215  ```
   216  
   217  If you look at logs of the main broker pod (index 0 of the job above), there is a lot of
   218  output for debugging, and you can see "hello world" running at the end:
   219  
   220  ```bash
   221  kubectl logs hello-world7qgqd-0-wp596 
   222  ```
   223  
   224  <details>
   225  
   226  <summary>Flux Operator Lead Broker Output</summary>
   227  
   228  ```console
   229  🌀 Submit Mode: flux start -o --config /etc/flux/config -Scron.directory=/etc/flux/system/cron.d   -Stbon.fanout=256   -Srundir=/run/flux    -Sstatedir=/var/lib/flux   -Slocal-uri=local:///run/flux/local     -Slog-stderr-level=6    -Slog-stderr-mode=local  flux submit  -n 1 --quiet  --watch echo hello world
   230  broker.info[0]: start: none->join 0.399725ms
   231  broker.info[0]: parent-none: join->init 0.030894ms
   232  cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
   233  job-manager.info[0]: restart: 0 jobs
   234  job-manager.info[0]: restart: 0 running jobs
   235  job-manager.info[0]: restart: checkpoint.job-manager not found
   236  broker.info[0]: rc1.0: running /etc/flux/rc1.d/01-sched-fluxion
   237  sched-fluxion-resource.info[0]: version 0.27.0-15-gc90fbcc2
   238  sched-fluxion-resource.warning[0]: create_reader: allowlist unsupported
   239  sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire
   240  sched-fluxion-qmanager.info[0]: version 0.27.0-15-gc90fbcc2
   241  broker.info[0]: rc1.0: running /etc/flux/rc1.d/02-cron
   242  broker.info[0]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.5s
   243  broker.info[0]: rc1-success: init->quorum 0.485239s
   244  broker.info[0]: online: hello-world7qgqd-0 (ranks 0)
   245  broker.info[0]: online: hello-world7qgqd-[0-3] (ranks 0-3)
   246  broker.info[0]: quorum-full: quorum->run 0.354587s
   247  hello world
   248  broker.info[0]: rc2.0: flux submit -n 1 --quiet --watch echo hello world Exited (rc=0) 0.3s
   249  broker.info[0]: rc2-success: run->cleanup 0.308392s
   250  broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s
   251  broker.info[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s
   252  broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.1s
   253  broker.info[0]: cleanup-success: cleanup->shutdown 0.252899s
   254  broker.info[0]: children-complete: shutdown->finalize 47.6699ms
   255  broker.info[0]: rc3.0: running /etc/flux/rc3.d/01-sched-fluxion
   256  broker.info[0]: rc3.0: /etc/flux/rc3 Exited (rc=0) 0.2s
   257  broker.info[0]: rc3-success: finalize->goodbye 0.212425s
   258  broker.info[0]: goodbye: goodbye->exit 0.06917ms
   259  ```
   260  
   261  </details>
   262  
   263  If you submit and ask for four tasks, you'll see "hello world" four times:
   264  
   265  ```bash
   266  python sample-flux-operator-job.py --tasks 4
   267  ```
   268  ```console
   269  ...
   270  broker.info[0]: quorum-full: quorum->run 23.5812s
   271  hello world
   272  hello world
   273  hello world
   274  hello world
   275  ```
   276  
   277  You can further customize the job, and can ask questions on the [Flux Operator issues board](https://github.com/flux-framework/flux-operator/issues).
   278  Finally, for instructions for how to do this with YAML outside of Python, see [Run A Flux MiniCluster](/docs/tasks/run_flux_minicluster/).
   279  
   280  ### MPI Operator Job
   281  
   282  For this example, we will be using the [MPI Operator](https://www.kubeflow.org/docs/components/training/mpi/)
   283  to submit a job, and specifically using the [Python SDK](https://github.com/kubeflow/mpi-operator/tree/master/sdk/python/v2beta1) to do this easily. Given our Python environment created in the [setup](#before-you-begin), we can install this Python SDK directly to it as follows:
   284  
   285  ```bash
   286  git clone --depth 1 https://github.com/kubeflow/mpi-operator /tmp/mpijob
   287  cd /tmp/mpijob/sdk/python/v2beta1
   288  python setup.py install
   289  cd -
   290  ```
   291  
   292  Importantly, the MPI Operator *must be installed before Kueue* for this to work! Let's start from scratch with a new Kind cluster.
   293  We will also need to [install the MPI operator](https://github.com/kubeflow/mpi-operator/tree/master#installation) and Kueue. Here we install
   294  the exact versions tested with this example:
   295  
   296  ```bash
   297  kubectl apply -f https://github.com/kubeflow/mpi-operator/releases/download/v0.4.0/mpi-operator.yaml
   298  kubectl apply -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.4.0/manifests.yaml
   299  ```
   300  
   301  Check the [mpi-operator release page](https://github.com/kubeflow/mpi-operator/releases) and [Kueue release page](https://github.com/kubernetes-sigs/kueue/releases) for alternate versions.
   302  You need to wait until Kueue is ready. You can determine this as follows:
   303  
   304  ```bash
   305  # Wait until you see all pods in the kueue-system are Running
   306  kubectl get pods -n kueue-system
   307  ```
   308  
   309  When Kueue is ready:
   310  
   311  ```bash
   312  kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/site/static/examples/admin/single-clusterqueue-setup.yaml
   313  ```
   314  
   315  Now try running the example MPI job.
   316  
   317  ```bash
   318  python sample-mpijob.py
   319  ```
   320  ```console
   321  📦️ Container image selected is mpioperator/mpi-pi:openmpi...
   322  ⭐️ Creating sample job with prefix pi...
   323  Use:
   324  "kubectl get queue" to see queue assignment
   325  "kubectl get jobs" to see jobs
   326  ```
   327  
   328  {{< include "examples/python/sample-mpijob.py" "python" >}}
   329  
   330  After submit, you can see that the queue has an admitted workload!
   331  
   332  ```bash
   333  $ kubectl get queue
   334  ```
   335  ```console
   336  NAME         CLUSTERQUEUE    PENDING WORKLOADS   ADMITTED WORKLOADS
   337  user-queue   cluster-queue   0                   1
   338  ```
   339  
   340  And that the job "pi-launcher" has started:
   341  
   342  ```bash
   343  $ kubectl get jobs
   344  NAME          COMPLETIONS   DURATION   AGE
   345  pi-launcher   0/1           9s         9s
   346  ```
   347  
   348  The MPI Operator works by way of a central launcher interacting with nodes via ssh. We can inspect
   349  a worker and the launcher to get a glimpse of how both work:
   350  
   351  ```bash
   352  $ kubectl logs pods/pi-worker-1 
   353  ```
   354  ```console
   355  Server listening on 0.0.0.0 port 22.
   356  Server listening on :: port 22.
   357  Accepted publickey for mpiuser from 10.244.0.8 port 51694 ssh2: ECDSA SHA256:rgZdwufXolOkUPA1w0bf780BNJC8e4/FivJb1/F7OOI
   358  Received disconnect from 10.244.0.8 port 51694:11: disconnected by user
   359  Disconnected from user mpiuser 10.244.0.8 port 51694
   360  Received signal 15; terminating.
   361  ```
   362  
   363  The job is fairly quick, and we can see the output of pi in the launcher:
   364  
   365  ```bash
   366  $ kubectl logs pods/pi-launcher-f4gqv 
   367  ```
   368  ```console
   369  Warning: Permanently added 'pi-worker-0.pi-worker.default.svc,10.244.0.7' (ECDSA) to the list of known hosts.
   370  Warning: Permanently added 'pi-worker-1.pi-worker.default.svc,10.244.0.9' (ECDSA) to the list of known hosts.
   371  Rank 1 on host pi-worker-1
   372  Workers: 2
   373  Rank 0 on host pi-worker-0
   374  pi is approximately 3.1410376000000002
   375  ```
   376  
   377  That looks like pi! 🎉️🥧️
   378  If you are interested in running this same example with YAML outside of Python, see [Run an MPIJob](/docs/tasks/run_mpi_jobs/).
   379