sigs.k8s.io/kueue@v0.6.2/keps/973-workload-priority/README.md (about)

     1  # KEP-973: Workload priority
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories](#user-stories)
    10      - [Story 1](#story-1)
    11      - [Story 2](#story-2)
    12    - [Risks and Mitigations](#risks-and-mitigations)
    13  - [Design Details](#design-details)
    14    - [Kueue WorkloadPriorityClass API](#kueue-workloadpriorityclass-api)
    15    - [How to use WorkloadPriorityClass on Job](#how-to-use-workloadpriorityclass-on-job)
    16    - [How to use WorkloadPriorityClass on MPIJob](#how-to-use-workloadpriorityclass-on-mpijob)
    17    - [How workloads are created from Jobs](#how-workloads-are-created-from-jobs)
    18      - [1. A job specifies both <code>workload's priority</code> and <code>pod's priority</code>](#1-a-job-specifies-both--and-)
    19      - [2. A job specifies only <code>workload's priority</code>](#2-a-job-specifies-only-)
    20      - [3. A job specifies only <code>pod's priority</code>](#3-a-job-specifies-only-)
    21      - [4. A jobFramework specifies both <code>workload's priority</code> and <code>priorityClass</code>](#4-a-jobframework-specifies-both--and-)
    22      - [5. A jobFramework specifies only <code>workload's priority</code>](#5-a-jobframework-specifies-only-)
    23      - [6. A jobFramework specifies only <code>priorityClass</code>](#6-a-jobframework-specifies-only-)
    24    - [Where workload's Priority is used](#where-workloads-priority-is-used)
    25    - [Workload's priority values are always mutable](#workloads-priority-values-are-always-mutable)
    26    - [What happens when a user changes the priority of <code>workloadPriorityClass</code>?](#what-happens-when-a-user-changes-the-priority-of-)
    27    - [Validation webhook](#validation-webhook)
    28    - [Future works](#future-works)
    29    - [Test Plan](#test-plan)
    30      - [Unit Tests](#unit-tests)
    31      - [Integration tests](#integration-tests)
    32    - [Graduation Criteria](#graduation-criteria)
    33  - [Implementation History](#implementation-history)
    34  - [Drawbacks](#drawbacks)
    35  - [Alternatives](#alternatives)
    36  <!-- /toc -->
    37  
    38  ## Summary
    39  
    40  In this proposal, a `WorkloadPriorityClass` is created.
    41  The `Workload` is able to utilize `WorkloadPriorityClass`.
    42  `WorkloadPriorityClass` is independent from pod's priority.
    43  The priority value is a part of the workload spec. The priority field of workload is mutable.  
    44  In this document, the term `workload Priority` is used to refer
    45  to the priority utilized by Kueue controller for managing the queueing
    46  and preemption of workloads.  
    47  The term `pod Priority` is used to denote the priority utilized by the
    48  kube-scheduler for preempting pods.
    49  
    50  ## Motivation
    51  
    52  Currently, some proposals are submitted for the kueue scheduling order.
    53  However, under the current implementation, the priority of the `Workload` is tied to the priority of the pod. Therefore, it's not possible to change only the priority of the `Workload`. We don't want to change the pod priority because pod's priority is tied to pod preemption. We need a mechanism where users can freely modify the priority of the `Workload` alone, not affecting pod priority.
    54  
    55  ### Goals
    56  
    57  Implement `WorkloadPriorityClass`. `Workload` can utilize `WorkloadPriorityClass`.  
    58  JobFrameworks like Job, MPIJob etc specify the `WorkloadPriorityClass` through labels.  
    59  Users can modify the priority of a `Workload` by changing `Workload`'s priority directly.
    60  
    61  ### Non-Goals
    62  
    63  Using existing k8s Pod's `PriorityClass` for Workload's priority is not recommended.  
    64  `WorkloadPriorityClass` doesn't implement all the features of the k8s Pod's `PriorityClass`
    65  because some fields on the k8s `PriorityClass` are not relevant to Kueue.  
    66  When creating a new `WorkloadPriorityClass`, there is no need to create other CRDs owned by `WorkloadPriorityClass`. Therefore, the reconcile functionality is unnecessary. The `WorkloadPriorityClass` controller will not be implemented for now.
    67  
    68  ## Proposal
    69  
    70  In this proposal, `WorkloadPriorityClass` is defined.  
    71  The `Workload` is able to utilize this `WorkloadPriorityClass`.  
    72  `WorkloadPriorityClass` is independent from pod's priority.  
    73  `Priority`, `PriorityClassName` and `PriorityClassSource` fields will be part of the workload spec.
    74  `Priority` field of `workload` is always mutable because it might be useful for the preemption.  
    75  Workload's `PriorityClassSource` and `PriorityClassName` fields are immutable for simplicity. 
    76  JobFrameworks like Job, MPIJob etc specify the `WorkloadPriorityClass` through labels.
    77  
    78  <!--
    79  This is where we get down to the specifics of what the proposal actually is.
    80  This should have enough detail that reviewers can understand exactly what
    81  you're proposing, but should not include things like API designs or
    82  implementation. What is the desired outcome and how do we measure success?.
    83  The "Design Details" section below is for the real
    84  nitty-gritty.
    85  -->
    86  
    87  ### User Stories
    88  
    89  Kueue issue [973](https://github.com/kubernetes-sigs/kueue/issues/973) provides details on the initial feature implementation.
    90  
    91  #### Story 1
    92  
    93  In an organization, admins want to set a lower priority for development workloads and a higher priority for production workloads.
    94  In such cases, they create two `WorkloadPriorityClass` and apply each one to the respective workloads.
    95  
    96  #### Story 2
    97  
    98  An organization desires to modify the priority of workloads that remain inactive for a specific duration.
    99  By developing a custom controller to manage Priority value of `Workload` spec, this expectation can be met.
   100  
   101  ### Risks and Mitigations
   102  
   103  It's possible that the pod's priority conflicts with the workload's priority.
   104  For example, a high-priority job with low-priority pods may never run to completion because it may always be preempted by kube-scheduler.
   105  We should document the risks of pod preemption to use.  
   106  We can also point users to create `PriorityClass` for their pods that are [non-preempting](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#non-preempting-priority-class).  
   107  If a workload's priority is high and pod's priority is low and the kube-scheduler initiates preemption, the pod's priority is prioritized. To prevent this behavior, non-preempting setting is needed.
   108  
   109  
   110  ## Design Details
   111  
   112  ### Kueue WorkloadPriorityClass API
   113  
   114  We introduce the `WorkloadPriorityClass` API.
   115  
   116  ```golang
   117  type WorkloadPriorityClass struct {
   118  	metav1.TypeMeta   `json:",inline"`
   119  	metav1.ObjectMeta `json:"metadata,omitempty"`
   120  
   121  	Value int32   `json:"value"`
   122  	Description string `json:"description,omitempty"`
   123  }
   124  ```
   125  
   126  Also `PriorityClassSource` field is added to `WorkloadSpec`.  
   127  The `PriorityClass` field can accept both Pod's `PriorityClass` and `WorkloadPriorityClass` names as values.
   128  To distinguish, when using `WorkloadPriorityClass`, a `PriorityClassSource` field has the `kueue.x-k8s.io/workloadpriorityclass` value.
   129  When using k8s Pod's `PriorityClass`, a `priorityClassSource` field has the `scheduling.k8s.io/priorityclass` value.
   130  
   131  ```golang
   132  type WorkloadSpec struct {
   133    ...
   134    PriorityClassSource string `json:"priorityClassSource,omitempty"`
   135    ...
   136  }
   137  ```
   138  
   139  ### How to use WorkloadPriorityClass on Job
   140  
   141  The `workloadPriorityClass` is specified through a label `kueue.x-k8s.io/priority-class`.
   142  This label is always mutable because it might be useful for the preemption.
   143  
   144  ```yaml
   145  # sample-priority-class.yaml
   146  apiVersion: kueue.x-k8s.io/v1beta1
   147  kind: WorkloadPriorityClass
   148  metadata:
   149    name: sample-priority
   150  value: 10000
   151  description: "Sample priority"
   152  ---
   153  # sample-job.yaml
   154  apiVersion: batch/v1
   155  kind: Job
   156  metadata:
   157    name: sample-job
   158    labels:
   159      kueue.x-k8s.io/queue-name: user-queue
   160      kueue.x-k8s.io/priority-class: sample-priority
   161  spec:
   162    parallelism: 3
   163    completions: 3
   164    suspend: true
   165    template:
   166      spec:
   167        containers:
   168        - name: dummy-job
   169          image: gcr.io/k8s-staging-perf-tests/sleep:latest
   170        restartPolicy: Never
   171  ```
   172  
   173  The following workload is generated by the yaml above.
   174  The `PriorityClassName` field can accept either `PriorityClass` or `workloadPriorityClass` name as a value.
   175  To distinguish, when using `WorkloadPriorityClass`, a `priorityClassSource` field has the `kueue.x-k8s.io/workloadpriorityclass` value.
   176  When using `PriorityClass`, a `priorityClassSource` field has the `scheduling.k8s.io/priorityclass` value.
   177  
   178  ```yaml
   179  apiVersion: kueue.x-k8s.io/v1beta1
   180  kind: Workload
   181  metadata:
   182    name: job-sample-job-7f173
   183  spec:
   184    priorityClassSource: kueue.x-k8s.io/workloadpriorityclass
   185    priorityClassName: sample-priority
   186    priority: 10000
   187    queueName: user-queue
   188    podSets:
   189    - count: 3
   190      name: dummy-job
   191      template:
   192        spec:
   193          containers:
   194          - image: gcr.io/k8s-staging-perf-tests/sleep:latest
   195            name: dummy-job
   196  ```
   197  
   198  In this example, since the `WorkloadPriorityClassName` of `sample-job` is set to `sample-priority`, the `priority` of the `sample-job` will be set to 10,000.
   199  During queuing and preemption of the workload, this priority value will be used in the calculations.
   200  
   201  ### How to use WorkloadPriorityClass on MPIJob
   202  
   203  The `workloadPriorityClass` is specified through a label `kueue.x-k8s.io/priority-class`.
   204  This is same as other CRDs like `RayJob`.
   205  
   206  ```yaml
   207  apiVersion: kubeflow.org/v2beta1
   208  kind: MPIJob
   209  metadata:
   210    name: pi
   211    labels:
   212      kueue.x-k8s.io/queue-name: user-queue
   213      kueue.x-k8s.io/priority-class: sample-priority
   214  spec:
   215  .....
   216  ```
   217  
   218  ### How workloads are created from Jobs
   219  
   220  There are three scenarios for creating a workload from a job.
   221  
   222  1. A job specifies both `workload's priority` and `pod's priority`
   223  2. A job specifies only `workload's priority`
   224  3. A job specifies only `pod's priority`
   225  
   226  In the case of jobFrameworks, the following scenarios are considered. For jobFrameworks, the `priorityClass` is intended to reflect the `pod's priority`. Therefore, `workloadPriorityClass` is used for the `workload's priority` if jobFramework has both `workloadPriorityClass` and `priorityClass`.
   227  
   228  4. A jobFramework specifies both `workload's priority` and `priorityClass`
   229  5. A jobFramework specifies only `workload's priority`
   230  6. A jobFramework specifies only `priorityClass`
   231  
   232  #### 1. A job specifies both `workload's priority` and `pod's priority`
   233  
   234  When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`.
   235  On the other hand, the `priorityClass` high-priority is used for the `pod's priority`.
   236  
   237  ```yaml
   238  apiVersion: batch/v1
   239  kind: Job
   240  metadata:
   241    generateName: sample-job-
   242    labels:
   243      kueue.x-k8s.io/queue-name: user-queue
   244      kueue.x-k8s.io/priority-class: sample-priority
   245  spec:
   246    priorityClassName: high-priority
   247    parallelism: 3
   248    completions: 3
   249    suspend: true
   250    template:
   251      spec:
   252        containers:
   253        - name: dummy-job
   254          image: gcr.io/k8s-staging-perf-tests/sleep:latest
   255        restartPolicy: Never
   256  ```
   257  
   258  #### 2. A job specifies only `workload's priority`
   259  
   260  When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`.
   261  
   262  ```yaml
   263  apiVersion: batch/v1
   264  kind: Job
   265  metadata:
   266    generateName: sample-job-
   267    labels:
   268      kueue.x-k8s.io/queue-name: user-queue
   269      kueue.x-k8s.io/priority-class: sample-priority
   270  spec:
   271    parallelism: 3
   272    completions: 3
   273    suspend: true
   274    template:
   275      spec:
   276        containers:
   277        - name: dummy-job
   278          image: gcr.io/k8s-staging-perf-tests/sleep:latest
   279        restartPolicy: Never
   280  ```
   281  
   282  #### 3. A job specifies only `pod's priority`
   283  
   284  When creating this yaml, the `PriorityClass` high-priority is used for the `workload's priority`.
   285  This is basically same as current implementation of workload.
   286  
   287  ```yaml
   288  apiVersion: batch/v1
   289  kind: Job
   290  metadata:
   291    generateName: sample-job-
   292    labels:
   293      kueue.x-k8s.io/queue-name: user-queue
   294  spec:
   295    priorityClassName: high-priority
   296    parallelism: 3
   297    completions: 3
   298    suspend: true
   299    template:
   300      spec:
   301        containers:
   302        - name: dummy-job
   303          image: gcr.io/k8s-staging-perf-tests/sleep:latest
   304        restartPolicy: Never
   305  ```
   306  
   307  #### 4. A jobFramework specifies both `workload's priority` and `priorityClass`
   308  
   309  When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`.
   310  
   311  ```yaml
   312  apiVersion: kubeflow.org/v2beta1
   313  kind: MPIJob
   314  metadata:
   315    name: pi
   316    labels:
   317      kueue.x-k8s.io/queue-name: user-queue
   318      kueue.x-k8s.io/priority-class: sample-priority
   319  spec:
   320    slotsPerWorker: 1
   321    runPolicy:
   322      cleanPodPolicy: Running
   323      ttlSecondsAfterFinished: 60
   324      schedulingPolicy:
   325        priorityClass: high-priority
   326    sshAuthMountPath: /home/mpiuser/.ssh
   327    mpiReplicaSpecs:
   328      Launcher:
   329        replicas: 1
   330        template:
   331          spec:
   332            containers:
   333            - image: mpioperator/mpi-pi:openmpi
   334              name: mpi-launcher
   335              securityContext:
   336                runAsUser: 1000
   337              command:
   338              - mpirun
   339              args:
   340              - -n
   341              - "2"
   342              - /home/mpiuser/pi
   343              resources:
   344                limits:
   345                  cpu: 1
   346                  memory: 1Gi
   347  ```
   348  
   349  #### 5. A jobFramework specifies only `workload's priority`
   350  
   351  When creating this yaml, the `workloadPriorityClass` sample-priority is used for the `workload's priority`.
   352  
   353  ```yaml
   354  apiVersion: kubeflow.org/v2beta1
   355  kind: MPIJob
   356  metadata:
   357    name: pi
   358    labels:
   359      kueue.x-k8s.io/queue-name: user-queue
   360      kueue.x-k8s.io/priority-class: sample-priority
   361  spec:
   362    slotsPerWorker: 1
   363    runPolicy:
   364      cleanPodPolicy: Running
   365      ttlSecondsAfterFinished: 60
   366    sshAuthMountPath: /home/mpiuser/.ssh
   367    mpiReplicaSpecs:
   368      Launcher:
   369        replicas: 1
   370        template:
   371          spec:
   372            containers:
   373            - image: mpioperator/mpi-pi:openmpi
   374              name: mpi-launcher
   375              securityContext:
   376                runAsUser: 1000
   377              command:
   378              - mpirun
   379              args:
   380              - -n
   381              - "2"
   382              - /home/mpiuser/pi
   383              resources:
   384                limits:
   385                  cpu: 1
   386                  memory: 1Gi
   387  ```
   388  
   389  #### 6. A jobFramework specifies only `priorityClass`
   390  
   391  When creating this yaml, the `PriorityClass` high-priority is used for the `workload's priority`.
   392  This is basically same as current implementation of workload.
   393  
   394  ```yaml
   395  apiVersion: kubeflow.org/v2beta1
   396  kind: MPIJob
   397  metadata:
   398    name: pi
   399    labels:
   400      kueue.x-k8s.io/queue-name: user-queue
   401  spec:
   402    slotsPerWorker: 1
   403    runPolicy:
   404      cleanPodPolicy: Running
   405      ttlSecondsAfterFinished: 60
   406      schedulingPolicy:
   407        priorityClass: high-priority
   408    sshAuthMountPath: /home/mpiuser/.ssh
   409    mpiReplicaSpecs:
   410      Launcher:
   411        replicas: 1
   412        template:
   413          spec:
   414            containers:
   415            - image: mpioperator/mpi-pi:openmpi
   416              name: mpi-launcher
   417              securityContext:
   418                runAsUser: 1000
   419              command:
   420              - mpirun
   421              args:
   422              - -n
   423              - "2"
   424              - /home/mpiuser/pi
   425              resources:
   426                limits:
   427                  cpu: 1
   428                  memory: 1Gi
   429  ```
   430  
   431  ### Where workload's Priority is used
   432  
   433  The priority of workloads is utilized in queuing, preemption, and other scheduling processes in Kueue.
   434  With the introduction of `workloadPriorityClass`, there is no change in the places where priority is used in Kueue.
   435  It just enables the usage of `workloadPriorityClass` as the priority.
   436  
   437  ### Workload's priority values are always mutable
   438  
   439  Workload's `Priority` field is always mutable because it might be useful for the preemption.  
   440  Workload's `PriorityClassSource` and `PriorityClassName` fields are immutable for simplicity.  
   441  By the way, there is an [open KEP](https://github.com/kubernetes/enhancements/pull/4129) to make `PriorityClass` mutable in k8s. This `workload`'s design aligns with the direction of k8s `PriorityClass`.
   442  
   443  ### What happens when a user changes the priority of `workloadPriorityClass`?
   444  
   445  The priority of existing workloads isn't altered even if a priority of `workloadPriorityClass` has been updated. This is because users would like to modify priorities for individual workloads, as mentioned in [Story 2](#story-2).
   446  For newly created workloads, their priorities is based on the latest priority value of `workloadPriorityClass`.
   447  As a result, even if there is a change in the value of workloadPriorityClass, the reconciliation process for workload controller doesn't change the priority of existing workloads.
   448  
   449  ### Validation webhook
   450  
   451  By introducing workload webhook, it makes the `workloadPriorityClass` field and `workloadPrioritySource` in the workload CRD immutable.  
   452  Also, by introducing job's webhook, it makes the `workloadPriorityClass` label of jobs immutable.
   453  
   454  ### Future works
   455  
   456  In the future, we plan to enable each organization using Kueue to customize the priority values according to their specific requirements through CRDs defined by each organization.
   457  
   458  ### Test Plan
   459  
   460  No regressions in the current test should be observed.
   461  
   462  [X] I/we understand the owners of the involved components may require updates to
   463  existing tests to make this code solid enough prior to committing the changes necessary
   464  to implement this enhancement.
   465  
   466  #### Unit Tests
   467  
   468  This change should be covered by unit tests.
   469  
   470  #### Integration tests
   471  
   472  The following scenarios will be covered with integration tests where `WorkloadPriorityClass` is used:
   473  - Controller and webhook tests related to `Workload`
   474  - Integration tests for job controller where the existing integration tests already cover `PriorityClass`
   475  - e2e tests where the existing tests already cover `PriorityClass`
   476  
   477  ### Graduation Criteria
   478  
   479  
   480  ## Implementation History
   481  
   482  
   483  ## Drawbacks
   484  
   485  
   486  ## Alternatives