sigs.k8s.io/kueue@v0.6.2/keps/973-workload-priority/README.md (about) 1 # KEP-973: Workload priority 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories](#user-stories) 10 - [Story 1](#story-1) 11 - [Story 2](#story-2) 12 - [Risks and Mitigations](#risks-and-mitigations) 13 - [Design Details](#design-details) 14 - [Kueue WorkloadPriorityClass API](#kueue-workloadpriorityclass-api) 15 - [How to use WorkloadPriorityClass on Job](#how-to-use-workloadpriorityclass-on-job) 16 - [How to use WorkloadPriorityClass on MPIJob](#how-to-use-workloadpriorityclass-on-mpijob) 17 - [How workloads are created from Jobs](#how-workloads-are-created-from-jobs) 18 - [1. A job specifies both <code>workload's priority</code> and <code>pod's priority</code>](#1-a-job-specifies-both--and-) 19 - [2. A job specifies only <code>workload's priority</code>](#2-a-job-specifies-only-) 20 - [3. A job specifies only <code>pod's priority</code>](#3-a-job-specifies-only-) 21 - [4. A jobFramework specifies both <code>workload's priority</code> and <code>priorityClass</code>](#4-a-jobframework-specifies-both--and-) 22 - [5. A jobFramework specifies only <code>workload's priority</code>](#5-a-jobframework-specifies-only-) 23 - [6. A jobFramework specifies only <code>priorityClass</code>](#6-a-jobframework-specifies-only-) 24 - [Where workload's Priority is used](#where-workloads-priority-is-used) 25 - [Workload's priority values are always mutable](#workloads-priority-values-are-always-mutable) 26 - [What happens when a user changes the priority of <code>workloadPriorityClass</code>?](#what-happens-when-a-user-changes-the-priority-of-) 27 - [Validation webhook](#validation-webhook) 28 - [Future works](#future-works) 29 - [Test Plan](#test-plan) 30 - [Unit Tests](#unit-tests) 31 - [Integration tests](#integration-tests) 32 - [Graduation Criteria](#graduation-criteria) 33 - [Implementation History](#implementation-history) 34 - [Drawbacks](#drawbacks) 35 - [Alternatives](#alternatives) 36 <!-- /toc --> 37 38 ## Summary 39 40 In this proposal, a `WorkloadPriorityClass` is created. 41 The `Workload` is able to utilize `WorkloadPriorityClass`. 42 `WorkloadPriorityClass` is independent from pod's priority. 43 The priority value is a part of the workload spec. The priority field of workload is mutable. 44 In this document, the term `workload Priority` is used to refer 45 to the priority utilized by Kueue controller for managing the queueing 46 and preemption of workloads. 47 The term `pod Priority` is used to denote the priority utilized by the 48 kube-scheduler for preempting pods. 49 50 ## Motivation 51 52 Currently, some proposals are submitted for the kueue scheduling order. 53 However, under the current implementation, the priority of the `Workload` is tied to the priority of the pod. Therefore, it's not possible to change only the priority of the `Workload`. We don't want to change the pod priority because pod's priority is tied to pod preemption. We need a mechanism where users can freely modify the priority of the `Workload` alone, not affecting pod priority. 54 55 ### Goals 56 57 Implement `WorkloadPriorityClass`. `Workload` can utilize `WorkloadPriorityClass`. 58 JobFrameworks like Job, MPIJob etc specify the `WorkloadPriorityClass` through labels. 59 Users can modify the priority of a `Workload` by changing `Workload`'s priority directly. 60 61 ### Non-Goals 62 63 Using existing k8s Pod's `PriorityClass` for Workload's priority is not recommended. 64 `WorkloadPriorityClass` doesn't implement all the features of the k8s Pod's `PriorityClass` 65 because some fields on the k8s `PriorityClass` are not relevant to Kueue. 66 When creating a new `WorkloadPriorityClass`, there is no need to create other CRDs owned by `WorkloadPriorityClass`. Therefore, the reconcile functionality is unnecessary. The `WorkloadPriorityClass` controller will not be implemented for now. 67 68 ## Proposal 69 70 In this proposal, `WorkloadPriorityClass` is defined. 71 The `Workload` is able to utilize this `WorkloadPriorityClass`. 72 `WorkloadPriorityClass` is independent from pod's priority. 73 `Priority`, `PriorityClassName` and `PriorityClassSource` fields will be part of the workload spec. 74 `Priority` field of `workload` is always mutable because it might be useful for the preemption. 75 Workload's `PriorityClassSource` and `PriorityClassName` fields are immutable for simplicity. 76 JobFrameworks like Job, MPIJob etc specify the `WorkloadPriorityClass` through labels. 77 78 <!-- 79 This is where we get down to the specifics of what the proposal actually is. 80 This should have enough detail that reviewers can understand exactly what 81 you're proposing, but should not include things like API designs or 82 implementation. What is the desired outcome and how do we measure success?. 83 The "Design Details" section below is for the real 84 nitty-gritty. 85 --> 86 87 ### User Stories 88 89 Kueue issue [973](https://github.com/kubernetes-sigs/kueue/issues/973) provides details on the initial feature implementation. 90 91 #### Story 1 92 93 In an organization, admins want to set a lower priority for development workloads and a higher priority for production workloads. 94 In such cases, they create two `WorkloadPriorityClass` and apply each one to the respective workloads. 95 96 #### Story 2 97 98 An organization desires to modify the priority of workloads that remain inactive for a specific duration. 99 By developing a custom controller to manage Priority value of `Workload` spec, this expectation can be met. 100 101 ### Risks and Mitigations 102 103 It's possible that the pod's priority conflicts with the workload's priority. 104 For example, a high-priority job with low-priority pods may never run to completion because it may always be preempted by kube-scheduler. 105 We should document the risks of pod preemption to use. 106 We can also point users to create `PriorityClass` for their pods that are [non-preempting](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#non-preempting-priority-class). 107 If a workload's priority is high and pod's priority is low and the kube-scheduler initiates preemption, the pod's priority is prioritized. To prevent this behavior, non-preempting setting is needed. 108 109 110 ## Design Details 111 112 ### Kueue WorkloadPriorityClass API 113 114 We introduce the `WorkloadPriorityClass` API. 115 116 ```golang 117 type WorkloadPriorityClass struct { 118 metav1.TypeMeta `json:",inline"` 119 metav1.ObjectMeta `json:"metadata,omitempty"` 120 121 Value int32 `json:"value"` 122 Description string `json:"description,omitempty"` 123 } 124 ``` 125 126 Also `PriorityClassSource` field is added to `WorkloadSpec`. 127 The `PriorityClass` field can accept both Pod's `PriorityClass` and `WorkloadPriorityClass` names as values. 128 To distinguish, when using `WorkloadPriorityClass`, a `PriorityClassSource` field has the `kueue.x-k8s.io/workloadpriorityclass` value. 129 When using k8s Pod's `PriorityClass`, a `priorityClassSource` field has the `scheduling.k8s.io/priorityclass` value. 130 131 ```golang 132 type WorkloadSpec struct { 133 ... 134 PriorityClassSource string `json:"priorityClassSource,omitempty"` 135 ... 136 } 137 ``` 138 139 ### How to use WorkloadPriorityClass on Job 140 141 The `workloadPriorityClass` is specified through a label `kueue.x-k8s.io/priority-class`. 142 This label is always mutable because it might be useful for the preemption. 143 144 ```yaml 145 # sample-priority-class.yaml 146 apiVersion: kueue.x-k8s.io/v1beta1 147 kind: WorkloadPriorityClass 148 metadata: 149 name: sample-priority 150 value: 10000 151 description: "Sample priority" 152 --- 153 # sample-job.yaml 154 apiVersion: batch/v1 155 kind: Job 156 metadata: 157 name: sample-job 158 labels: 159 kueue.x-k8s.io/queue-name: user-queue 160 kueue.x-k8s.io/priority-class: sample-priority 161 spec: 162 parallelism: 3 163 completions: 3 164 suspend: true 165 template: 166 spec: 167 containers: 168 - name: dummy-job 169 image: gcr.io/k8s-staging-perf-tests/sleep:latest 170 restartPolicy: Never 171 ``` 172 173 The following workload is generated by the yaml above. 174 The `PriorityClassName` field can accept either `PriorityClass` or `workloadPriorityClass` name as a value. 175 To distinguish, when using `WorkloadPriorityClass`, a `priorityClassSource` field has the `kueue.x-k8s.io/workloadpriorityclass` value. 176 When using `PriorityClass`, a `priorityClassSource` field has the `scheduling.k8s.io/priorityclass` value. 177 178 ```yaml 179 apiVersion: kueue.x-k8s.io/v1beta1 180 kind: Workload 181 metadata: 182 name: job-sample-job-7f173 183 spec: 184 priorityClassSource: kueue.x-k8s.io/workloadpriorityclass 185 priorityClassName: sample-priority 186 priority: 10000 187 queueName: user-queue 188 podSets: 189 - count: 3 190 name: dummy-job 191 template: 192 spec: 193 containers: 194 - image: gcr.io/k8s-staging-perf-tests/sleep:latest 195 name: dummy-job 196 ``` 197 198 In this example, since the `WorkloadPriorityClassName` of `sample-job` is set to `sample-priority`, the `priority` of the `sample-job` will be set to 10,000. 199 During queuing and preemption of the workload, this priority value will be used in the calculations. 200 201 ### How to use WorkloadPriorityClass on MPIJob 202 203 The `workloadPriorityClass` is specified through a label `kueue.x-k8s.io/priority-class`. 204 This is same as other CRDs like `RayJob`. 205 206 ```yaml 207 apiVersion: kubeflow.org/v2beta1 208 kind: MPIJob 209 metadata: 210 name: pi 211 labels: 212 kueue.x-k8s.io/queue-name: user-queue 213 kueue.x-k8s.io/priority-class: sample-priority 214 spec: 215 ..... 216 ``` 217 218 ### How workloads are created from Jobs 219 220 There are three scenarios for creating a workload from a job. 221 222 1. A job specifies both `workload's priority` and `pod's priority` 223 2. A job specifies only `workload's priority` 224 3. A job specifies only `pod's priority` 225 226 In the case of jobFrameworks, the following scenarios are considered. For jobFrameworks, the `priorityClass` is intended to reflect the `pod's priority`. Therefore, `workloadPriorityClass` is used for the `workload's priority` if jobFramework has both `workloadPriorityClass` and `priorityClass`. 227 228 4. A jobFramework specifies both `workload's priority` and `priorityClass` 229 5. A jobFramework specifies only `workload's priority` 230 6. A jobFramework specifies only `priorityClass` 231 232 #### 1. A job specifies both `workload's priority` and `pod's priority` 233 234 When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`. 235 On the other hand, the `priorityClass` high-priority is used for the `pod's priority`. 236 237 ```yaml 238 apiVersion: batch/v1 239 kind: Job 240 metadata: 241 generateName: sample-job- 242 labels: 243 kueue.x-k8s.io/queue-name: user-queue 244 kueue.x-k8s.io/priority-class: sample-priority 245 spec: 246 priorityClassName: high-priority 247 parallelism: 3 248 completions: 3 249 suspend: true 250 template: 251 spec: 252 containers: 253 - name: dummy-job 254 image: gcr.io/k8s-staging-perf-tests/sleep:latest 255 restartPolicy: Never 256 ``` 257 258 #### 2. A job specifies only `workload's priority` 259 260 When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`. 261 262 ```yaml 263 apiVersion: batch/v1 264 kind: Job 265 metadata: 266 generateName: sample-job- 267 labels: 268 kueue.x-k8s.io/queue-name: user-queue 269 kueue.x-k8s.io/priority-class: sample-priority 270 spec: 271 parallelism: 3 272 completions: 3 273 suspend: true 274 template: 275 spec: 276 containers: 277 - name: dummy-job 278 image: gcr.io/k8s-staging-perf-tests/sleep:latest 279 restartPolicy: Never 280 ``` 281 282 #### 3. A job specifies only `pod's priority` 283 284 When creating this yaml, the `PriorityClass` high-priority is used for the `workload's priority`. 285 This is basically same as current implementation of workload. 286 287 ```yaml 288 apiVersion: batch/v1 289 kind: Job 290 metadata: 291 generateName: sample-job- 292 labels: 293 kueue.x-k8s.io/queue-name: user-queue 294 spec: 295 priorityClassName: high-priority 296 parallelism: 3 297 completions: 3 298 suspend: true 299 template: 300 spec: 301 containers: 302 - name: dummy-job 303 image: gcr.io/k8s-staging-perf-tests/sleep:latest 304 restartPolicy: Never 305 ``` 306 307 #### 4. A jobFramework specifies both `workload's priority` and `priorityClass` 308 309 When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`. 310 311 ```yaml 312 apiVersion: kubeflow.org/v2beta1 313 kind: MPIJob 314 metadata: 315 name: pi 316 labels: 317 kueue.x-k8s.io/queue-name: user-queue 318 kueue.x-k8s.io/priority-class: sample-priority 319 spec: 320 slotsPerWorker: 1 321 runPolicy: 322 cleanPodPolicy: Running 323 ttlSecondsAfterFinished: 60 324 schedulingPolicy: 325 priorityClass: high-priority 326 sshAuthMountPath: /home/mpiuser/.ssh 327 mpiReplicaSpecs: 328 Launcher: 329 replicas: 1 330 template: 331 spec: 332 containers: 333 - image: mpioperator/mpi-pi:openmpi 334 name: mpi-launcher 335 securityContext: 336 runAsUser: 1000 337 command: 338 - mpirun 339 args: 340 - -n 341 - "2" 342 - /home/mpiuser/pi 343 resources: 344 limits: 345 cpu: 1 346 memory: 1Gi 347 ``` 348 349 #### 5. A jobFramework specifies only `workload's priority` 350 351 When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`. 352 353 ```yaml 354 apiVersion: kubeflow.org/v2beta1 355 kind: MPIJob 356 metadata: 357 name: pi 358 labels: 359 kueue.x-k8s.io/queue-name: user-queue 360 kueue.x-k8s.io/priority-class: sample-priority 361 spec: 362 slotsPerWorker: 1 363 runPolicy: 364 cleanPodPolicy: Running 365 ttlSecondsAfterFinished: 60 366 sshAuthMountPath: /home/mpiuser/.ssh 367 mpiReplicaSpecs: 368 Launcher: 369 replicas: 1 370 template: 371 spec: 372 containers: 373 - image: mpioperator/mpi-pi:openmpi 374 name: mpi-launcher 375 securityContext: 376 runAsUser: 1000 377 command: 378 - mpirun 379 args: 380 - -n 381 - "2" 382 - /home/mpiuser/pi 383 resources: 384 limits: 385 cpu: 1 386 memory: 1Gi 387 ``` 388 389 #### 6. A jobFramework specifies only `priorityClass` 390 391 When creating this yaml, the `PriorityClass` high-priority is used for the `workload's priority`. 392 This is basically same as current implementation of workload. 393 394 ```yaml 395 apiVersion: kubeflow.org/v2beta1 396 kind: MPIJob 397 metadata: 398 name: pi 399 labels: 400 kueue.x-k8s.io/queue-name: user-queue 401 spec: 402 slotsPerWorker: 1 403 runPolicy: 404 cleanPodPolicy: Running 405 ttlSecondsAfterFinished: 60 406 schedulingPolicy: 407 priorityClass: high-priority 408 sshAuthMountPath: /home/mpiuser/.ssh 409 mpiReplicaSpecs: 410 Launcher: 411 replicas: 1 412 template: 413 spec: 414 containers: 415 - image: mpioperator/mpi-pi:openmpi 416 name: mpi-launcher 417 securityContext: 418 runAsUser: 1000 419 command: 420 - mpirun 421 args: 422 - -n 423 - "2" 424 - /home/mpiuser/pi 425 resources: 426 limits: 427 cpu: 1 428 memory: 1Gi 429 ``` 430 431 ### Where workload's Priority is used 432 433 The priority of workloads is utilized in queuing, preemption, and other scheduling processes in Kueue. 434 With the introduction of `workloadPriorityClass`, there is no change in the places where priority is used in Kueue. 435 It just enables the usage of `workloadPriorityClass` as the priority. 436 437 ### Workload's priority values are always mutable 438 439 Workload's `Priority` field is always mutable because it might be useful for the preemption. 440 Workload's `PriorityClassSource` and `PriorityClassName` fields are immutable for simplicity. 441 By the way, there is an [open KEP](https://github.com/kubernetes/enhancements/pull/4129) to make `PriorityClass` mutable in k8s. This `workload`'s design aligns with the direction of k8s `PriorityClass`. 442 443 ### What happens when a user changes the priority of `workloadPriorityClass`? 444 445 The priority of existing workloads isn't altered even if a priority of `workloadPriorityClass` has been updated. This is because users would like to modify priorities for individual workloads, as mentioned in [Story 2](#story-2). 446 For newly created workloads, their priorities is based on the latest priority value of `workloadPriorityClass`. 447 As a result, even if there is a change in the value of workloadPriorityClass, the reconciliation process for workload controller doesn't change the priority of existing workloads. 448 449 ### Validation webhook 450 451 By introducing workload webhook, it makes the `workloadPriorityClass` field and `workloadPrioritySource` in the workload CRD immutable. 452 Also, by introducing job's webhook, it makes the `workloadPriorityClass` label of jobs immutable. 453 454 ### Future works 455 456 In the future, we plan to enable each organization using Kueue to customize the priority values according to their specific requirements through CRDs defined by each organization. 457 458 ### Test Plan 459 460 No regressions in the current test should be observed. 461 462 [X] I/we understand the owners of the involved components may require updates to 463 existing tests to make this code solid enough prior to committing the changes necessary 464 to implement this enhancement. 465 466 #### Unit Tests 467 468 This change should be covered by unit tests. 469 470 #### Integration tests 471 472 The following scenarios will be covered with integration tests where `WorkloadPriorityClass` is used: 473 - Controller and webhook tests related to `Workload` 474 - Integration tests for job controller where the existing integration tests already cover `PriorityClass` 475 - e2e tests where the existing tests already cover `PriorityClass` 476 477 ### Graduation Criteria 478 479 480 ## Implementation History 481 482 483 ## Drawbacks 484 485 486 ## Alternatives