sigs.k8s.io/kueue@v0.6.2/site/content/en/docs/tasks/run_python_jobs.md (about) 1 --- 2 title: "Run Jobs Using Python" 3 date: 2023-07-05 4 weight: 7 5 description: > 6 Run Kueue jobs programmatically with Python 7 --- 8 9 This guide is for [batch users](/docs/tasks#batch-user) that have a basic understanding of interacting with Kubernetes from Python. For more information, see [Kueue's overview](/docs/overview). 10 11 ## Before you begin 12 13 Check [administer cluster quotas](/docs/tasks/administer_cluster_quotas) for details on the initial cluster setup. 14 You'll also need kubernetes python installed. We recommend a virtual environment. 15 16 ```bash 17 python -m venv env 18 source env/bin/activate 19 pip install kubernetes requests 20 ``` 21 22 Note that the following versions were used for developing these examples: 23 24 - **Python**: 3.9.12 25 - **kubernetes**: 26.1.0 26 - **requests**: 2.31.0 27 28 You can either follow the [install instructions](https://github.com/kubernetes-sigs/kueue#installation) for Kueue, or use the install example, below. 29 30 ## Kueue in Python 31 32 Kueue at the core is a controller for a [Custom Resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/), and so to interact with it from Python we don't need a custom SDK, but rather we can use the generic functions provided by the 33 [Kubernetes Python](https://github.com/kubernetes-client/python) library. In this guide, we provide several examples 34 for interacting with Kueue in this fashion. If you would like to request a new example or would like help for a specific use 35 case, please [open an issue](https://github.com/kubernetes-sigs/kueue/issues). 36 37 ## Examples 38 39 The following examples demonstrate different use cases for using Kueue in Python. 40 41 ### Install Kueue 42 43 This example demonstrates installing Kueue to an existing cluster. You can save this 44 script to your local machine as `install-kueue-queues.py`. 45 46 {{< include file="examples/python/install-kueue-queues.py" lang="python" >}} 47 48 And then run as follows: 49 50 ```bash 51 python install-kueue-queues.py 52 ``` 53 54 ```console 55 ⭐️ Installing Kueue... 56 ⭐️ Applying queues from single-clusterqueue-setup.yaml... 57 ``` 58 59 You can also target a specific version: 60 61 ```bash 62 python install-kueue-queues.py --version {{< param "version" >}} 63 ``` 64 65 ### Sample Job 66 67 For the next example, let's start with a cluster with Kueue installed, and first create our queues: 68 69 {{< include file="examples/python/sample-job.py" code="true" lang="python" >}} 70 71 And run as follows: 72 73 ```bash 74 python sample-job.py 75 ``` 76 ```console 77 📦️ Container image selected is gcr.io/k8s-staging-perf-tests/sleep:v0.1.0... 78 ⭐️ Creating sample job with prefix sample-job-... 79 Use: 80 "kubectl get queue" to see queue assignment 81 "kubectl get jobs" to see jobs 82 ``` 83 84 or try changing the name (`generateName`) of the job: 85 86 ```bash 87 python sample-job.py --job-name sleep-job- 88 ``` 89 90 ```console 91 📦️ Container image selected is gcr.io/k8s-staging-perf-tests/sleep:v0.1.0... 92 ⭐️ Creating sample job with prefix sleep-job-... 93 Use: 94 "kubectl get queue" to see queue assignment 95 "kubectl get jobs" to see jobs 96 ``` 97 98 You can also change the container image with `--image` and args with `--args`. 99 For more customization, you can edit the example script. 100 101 ### Interact with Queues and Jobs 102 103 If you are developing an application that submits jobs and needs to interact 104 with and check on them, you likely want to interact with queues or jobs directly. 105 After running the example above, you can test the following example to interact 106 with the results. Write the following to a script called `sample-queue-control.py`. 107 108 {{< include file="examples/python/sample-queue-control.py" lang="python" >}} 109 110 To make the output more interesting, we can run a few random jobs first: 111 112 ```bash 113 python sample-job.py 114 python sample-job.py 115 python sample-job.py --job-name tacos 116 ``` 117 118 And then run the script to see your queue and sample job that you submit previously. 119 120 ```bash 121 python sample-queue-control.py 122 ``` 123 ```console 124 ⛑️ Local Queues 125 Found queue user-queue 126 Admitted workloads: 3 127 Pending workloads: 0 128 Flavor default-flavor has resources [{'name': 'cpu', 'total': '3'}, {'name': 'memory', 'total': '600Mi'}] 129 130 💼️ Jobs 131 Found job sample-job-8n5sb 132 Succeeded: 3 133 Ready: 0 134 Found job sample-job-gnxtl 135 Succeeded: 1 136 Ready: 0 137 Found job tacos46bqw 138 Succeeded: 1 139 Ready: 1 140 ``` 141 142 If you wanted to filter jobs to a specific queue, you can do this via the job labels 143 under `job["metadata"]["labels"]["kueue.x-k8s.io/queue-name"]'. To list a specific job by 144 name, you can do: 145 146 ```python 147 from kubernetes import client, config 148 149 # Interact with batch 150 config.load_kube_config() 151 batch_api = client.BatchV1Api() 152 153 # This is providing the name, and namespace 154 job = batch_api.read_namespaced_job("tacos46bqw", "default") 155 print(job) 156 ``` 157 158 See the [BatchV1](https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/BatchV1Api.md) 159 API documentation for more calls. 160 161 162 ### Flux Operator Job 163 164 For this example, we will be using the [Flux Operator](https://github.com/flux-framework/flux-operator) 165 to submit a job, and specifically using the [Python SDK](https://github.com/flux-framework/flux-operator/tree/main/sdk/python/v1alpha1) to do this easily. Given our Python environment created in the [setup](#before-you-begin), we can install this Python SDK directly to it as follows: 166 167 ```bash 168 pip install fluxoperator 169 ``` 170 171 We will also need to [install the Flux operator](https://flux-framework.org/flux-operator/getting_started/user-guide.html#quick-install). 172 173 ```bash 174 kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/main/examples/dist/flux-operator.yaml 175 ``` 176 177 Write the following script to `sample-flux-operator-job.py`: 178 179 {{< include file="examples/python/sample-flux-operator-job.py" lang="python" >}} 180 181 Now try running the example: 182 183 ```bash 184 python sample-flux-operator-job.py 185 ``` 186 ```console 187 📦️ Container image selected is ghcr.io/flux-framework/flux-restful-api... 188 ⭐️ Creating sample job with prefix hello-world... 189 Use: 190 "kubectl get queue" to see queue assignment 191 "kubectl get pods" to see pods 192 ``` 193 194 You'll be able to almost immediately see the MiniCluster job admitted to the local queue: 195 196 ```bash 197 kubectl get queue 198 ``` 199 ```console 200 NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS 201 user-queue cluster-queue 0 1 202 ``` 203 204 And the 4 pods running (we are creating a networked cluster with 4 nodes): 205 206 ```bash 207 kubectl get pods 208 ``` 209 ```console 210 NAME READY STATUS RESTARTS AGE 211 hello-world7qgqd-0-wp596 1/1 Running 0 7s 212 hello-world7qgqd-1-d7r87 1/1 Running 0 7s 213 hello-world7qgqd-2-rfn4t 1/1 Running 0 7s 214 hello-world7qgqd-3-blvtn 1/1 Running 0 7s 215 ``` 216 217 If you look at logs of the main broker pod (index 0 of the job above), there is a lot of 218 output for debugging, and you can see "hello world" running at the end: 219 220 ```bash 221 kubectl logs hello-world7qgqd-0-wp596 222 ``` 223 224 <details> 225 226 <summary>Flux Operator Lead Broker Output</summary> 227 228 ```console 229 🌀 Submit Mode: flux start -o --config /etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/run/flux -Sstatedir=/var/lib/flux -Slocal-uri=local:///run/flux/local -Slog-stderr-level=6 -Slog-stderr-mode=local flux submit -n 1 --quiet --watch echo hello world 230 broker.info[0]: start: none->join 0.399725ms 231 broker.info[0]: parent-none: join->init 0.030894ms 232 cron.info[0]: synchronizing cron tasks to event heartbeat.pulse 233 job-manager.info[0]: restart: 0 jobs 234 job-manager.info[0]: restart: 0 running jobs 235 job-manager.info[0]: restart: checkpoint.job-manager not found 236 broker.info[0]: rc1.0: running /etc/flux/rc1.d/01-sched-fluxion 237 sched-fluxion-resource.info[0]: version 0.27.0-15-gc90fbcc2 238 sched-fluxion-resource.warning[0]: create_reader: allowlist unsupported 239 sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire 240 sched-fluxion-qmanager.info[0]: version 0.27.0-15-gc90fbcc2 241 broker.info[0]: rc1.0: running /etc/flux/rc1.d/02-cron 242 broker.info[0]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.5s 243 broker.info[0]: rc1-success: init->quorum 0.485239s 244 broker.info[0]: online: hello-world7qgqd-0 (ranks 0) 245 broker.info[0]: online: hello-world7qgqd-[0-3] (ranks 0-3) 246 broker.info[0]: quorum-full: quorum->run 0.354587s 247 hello world 248 broker.info[0]: rc2.0: flux submit -n 1 --quiet --watch echo hello world Exited (rc=0) 0.3s 249 broker.info[0]: rc2-success: run->cleanup 0.308392s 250 broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s 251 broker.info[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s 252 broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.1s 253 broker.info[0]: cleanup-success: cleanup->shutdown 0.252899s 254 broker.info[0]: children-complete: shutdown->finalize 47.6699ms 255 broker.info[0]: rc3.0: running /etc/flux/rc3.d/01-sched-fluxion 256 broker.info[0]: rc3.0: /etc/flux/rc3 Exited (rc=0) 0.2s 257 broker.info[0]: rc3-success: finalize->goodbye 0.212425s 258 broker.info[0]: goodbye: goodbye->exit 0.06917ms 259 ``` 260 261 </details> 262 263 If you submit and ask for four tasks, you'll see "hello world" four times: 264 265 ```bash 266 python sample-flux-operator-job.py --tasks 4 267 ``` 268 ```console 269 ... 270 broker.info[0]: quorum-full: quorum->run 23.5812s 271 hello world 272 hello world 273 hello world 274 hello world 275 ``` 276 277 You can further customize the job, and can ask questions on the [Flux Operator issues board](https://github.com/flux-framework/flux-operator/issues). 278 Finally, for instructions for how to do this with YAML outside of Python, see [Run A Flux MiniCluster](/docs/tasks/run_flux_minicluster/). 279 280 ### MPI Operator Job 281 282 For this example, we will be using the [MPI Operator](https://www.kubeflow.org/docs/components/training/mpi/) 283 to submit a job, and specifically using the [Python SDK](https://github.com/kubeflow/mpi-operator/tree/master/sdk/python/v2beta1) to do this easily. Given our Python environment created in the [setup](#before-you-begin), we can install this Python SDK directly to it as follows: 284 285 ```bash 286 git clone --depth 1 https://github.com/kubeflow/mpi-operator /tmp/mpijob 287 cd /tmp/mpijob/sdk/python/v2beta1 288 python setup.py install 289 cd - 290 ``` 291 292 Importantly, the MPI Operator *must be installed before Kueue* for this to work! Let's start from scratch with a new Kind cluster. 293 We will also need to [install the MPI operator](https://github.com/kubeflow/mpi-operator/tree/master#installation) and Kueue. Here we install 294 the exact versions tested with this example: 295 296 ```bash 297 kubectl apply -f https://github.com/kubeflow/mpi-operator/releases/download/v0.4.0/mpi-operator.yaml 298 kubectl apply -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.4.0/manifests.yaml 299 ``` 300 301 Check the [mpi-operator release page](https://github.com/kubeflow/mpi-operator/releases) and [Kueue release page](https://github.com/kubernetes-sigs/kueue/releases) for alternate versions. 302 You need to wait until Kueue is ready. You can determine this as follows: 303 304 ```bash 305 # Wait until you see all pods in the kueue-system are Running 306 kubectl get pods -n kueue-system 307 ``` 308 309 When Kueue is ready: 310 311 ```bash 312 kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/site/static/examples/admin/single-clusterqueue-setup.yaml 313 ``` 314 315 Now try running the example MPI job. 316 317 ```bash 318 python sample-mpijob.py 319 ``` 320 ```console 321 📦️ Container image selected is mpioperator/mpi-pi:openmpi... 322 ⭐️ Creating sample job with prefix pi... 323 Use: 324 "kubectl get queue" to see queue assignment 325 "kubectl get jobs" to see jobs 326 ``` 327 328 {{< include "examples/python/sample-mpijob.py" "python" >}} 329 330 After submit, you can see that the queue has an admitted workload! 331 332 ```bash 333 $ kubectl get queue 334 ``` 335 ```console 336 NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS 337 user-queue cluster-queue 0 1 338 ``` 339 340 And that the job "pi-launcher" has started: 341 342 ```bash 343 $ kubectl get jobs 344 NAME COMPLETIONS DURATION AGE 345 pi-launcher 0/1 9s 9s 346 ``` 347 348 The MPI Operator works by way of a central launcher interacting with nodes via ssh. We can inspect 349 a worker and the launcher to get a glimpse of how both work: 350 351 ```bash 352 $ kubectl logs pods/pi-worker-1 353 ``` 354 ```console 355 Server listening on 0.0.0.0 port 22. 356 Server listening on :: port 22. 357 Accepted publickey for mpiuser from 10.244.0.8 port 51694 ssh2: ECDSA SHA256:rgZdwufXolOkUPA1w0bf780BNJC8e4/FivJb1/F7OOI 358 Received disconnect from 10.244.0.8 port 51694:11: disconnected by user 359 Disconnected from user mpiuser 10.244.0.8 port 51694 360 Received signal 15; terminating. 361 ``` 362 363 The job is fairly quick, and we can see the output of pi in the launcher: 364 365 ```bash 366 $ kubectl logs pods/pi-launcher-f4gqv 367 ``` 368 ```console 369 Warning: Permanently added 'pi-worker-0.pi-worker.default.svc,10.244.0.7' (ECDSA) to the list of known hosts. 370 Warning: Permanently added 'pi-worker-1.pi-worker.default.svc,10.244.0.9' (ECDSA) to the list of known hosts. 371 Rank 1 on host pi-worker-1 372 Workers: 2 373 Rank 0 on host pi-worker-0 374 pi is approximately 3.1410376000000002 375 ``` 376 377 That looks like pi! 🎉️🥧️ 378 If you are interested in running this same example with YAML outside of Python, see [Run an MPIJob](/docs/tasks/run_mpi_jobs/). 379