sigs.k8s.io/kueue@v0.6.2/keps/83-workload-preemption/README.md (about)

     1  # KEP-83: Workload preemption
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11    - [Notes/Contraints/Caveats (Optional)](#notescontraintscaveats-optional)
    12      - [Why no control to opt-out a ClusterQueue from preemption](#why-no-control-to-opt-out-a-clusterqueue-from-preemption)
    13    - [Reassigning flavors after preemption](#reassigning-flavors-after-preemption)
    14    - [Risks and Mitigations](#risks-and-mitigations)
    15      - [Workload preemption doesn't imply immediate Pod termination](#workload-preemption-doesnt-imply-immediate-pod-termination)
    16    - [Increased admission latency](#increased-admission-latency)
    17  - [Design Details](#design-details)
    18    - [ClusterQueue API changes](#clusterqueue-api-changes)
    19    - [Changes in scheduling algorithm](#changes-in-scheduling-algorithm)
    20      - [Detecting Workloads that might benefit from preemption](#detecting-workloads-that-might-benefit-from-preemption)
    21      - [Sorting Workloads that are heads of ClusterQueues](#sorting-workloads-that-are-heads-of-clusterqueues)
    22      - [Admission](#admission)
    23      - [Preemption](#preemption)
    24    - [Test Plan](#test-plan)
    25        - [Prerequisite testing updates](#prerequisite-testing-updates)
    26      - [Unit Tests](#unit-tests)
    27      - [Integration tests](#integration-tests)
    28      - [E2E tests](#e2e-tests)
    29    - [Graduation Criteria](#graduation-criteria)
    30  - [Implementation History](#implementation-history)
    31  - [Drawbacks](#drawbacks)
    32  - [Alternatives](#alternatives)
    33    - [Allow high priority jobs to borrow quota while preempting](#allow-high-priority-jobs-to-borrow-quota-while-preempting)
    34    - [Inform how costly is to interrupt a Workload](#inform-how-costly-is-to-interrupt-a-workload)
    35    - [Penalizing long running workloads](#penalizing-long-running-workloads)
    36    - [Terminating Workloads on preemption](#terminating-workloads-on-preemption)
    37    - [Extra knobs in ClusterQueue preemption policy](#extra-knobs-in-clusterqueue-preemption-policy)
    38  <!-- /toc -->
    39  
    40  ## Summary
    41  
    42  This enhancement introduces workload preemption, a mechanism to suspend
    43  workloads when:
    44  - ClusterQueues under their minimum quota need the resources that are currently
    45    borrowed by other ClusterQueues in the cohort. Alternatively, we say that the
    46    ClusterQueue needs to _reclaim_ its quota.
    47  - Within a ClusterQueue, there are running Workloads with lower priority than
    48    a pending Workload.
    49  
    50  API fields in the ClusterQueue spec determine preemption policies.
    51  
    52  ## Motivation
    53  
    54  When ClusterQueues under their minimum quota lend resources, they should
    55  be able to recover those resources fast, to be able to admit Workloads
    56  when there are sudden spikes. Similarly, the ClusterQueue should be able to
    57  recover quota from low priority workloads that are currently running.
    58  
    59  Currently, the only mechanism to recover those resources is to wait for
    60  Workloads to finish, which is generally unbounded.
    61  
    62  ### Goals
    63  
    64  - Preempt Workloads from ClusterQueues borrowing resources when other
    65    ClusterQueues in the cohort, under their minimum quota, need the resources.
    66  - Preempt Workloads within a ClusterQueue when a high priority Workload doesn't
    67    fit in the available quota, independently of borrowed quota.
    68  - Introduce API fields in ClusterQueue to control when preemption occurs.
    69  
    70  ### Non-Goals
    71  
    72  - Graceful termination of Workloads is left to the workload pods to implement.
    73  - Tracking usage by workloads that take arbitrary time to be suspended. See
    74    [Workload preemption doesn't imply immediate Pod termination](#workload-preemption-doesnt-imply-immediate-pod-termination) to learn more.
    75    For example, the integration with Job uses the [suspend field](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job).
    76  - Partial workload preemption is not supported.
    77  - Terminate workloads on preemption.
    78  - Penalize workloads with the same priority that have been running for a long
    79    time.
    80  
    81  ## Proposal
    82  
    83  This enhancement proposes the introduction of a field in the ClusterQueue to
    84  determine the preemption policy for two scenarios:
    85  - Reclaiming quota: a pending workload fits in the quota that is currently
    86    borrowed by other ClusterQueues in the cohort.
    87  - Pending high priority pod: the ClusterQueue is out of quota, but there are
    88    low priority active Workloads.
    89  
    90  The enhacement also includes an algorithm for selecting a set of Workloads to be
    91  preempted from the ClusterQueue or the cohort (to reclaim borrowed Quota).
    92  
    93  ### User Stories (Optional)
    94  
    95  #### Story 1
    96  
    97  As a cluster administrator, I want to control preemption of active Workloads
    98  within the ClusterQueue and/or Cohort to accommodate for a pending workload.
    99  
   100  A possible configuration looks like the following:
   101  
   102  ```yaml
   103  apiVersion: kueue.x-k8s.io/v1alpha2
   104  kind: ClusterQueue
   105  metadata:
   106    name: cluster-total
   107  spec:
   108    preemption:
   109      withinCohort: Always
   110      withinClusterQueue: LowerPriorityOnly
   111  ```
   112  
   113  ### Notes/Contraints/Caveats (Optional)
   114  
   115  #### Why no control to opt-out a ClusterQueue from preemption
   116  
   117  In a cohort, some ClusterQueue could have high priority Workloads running, so
   118  it might be desired not to disturb them.
   119  
   120  However, this could be achieved by two means:
   121  - Configuring the ClusterQueue with high priority Workloads to never borrow
   122    (through `.quota.max`), while owning a big part or all of the quota for the
   123    cohort.
   124  - Configuring other ClusterQueues to not preempt workloads in the cohort when
   125    reclaiming or only do so for incoming workload that have higher priority than
   126    the running workloads. In other words, the control is on the ClusterQueue that
   127    is lending the resources, rather than the borrower.
   128  
   129  ### Reassigning flavors after preemption
   130  
   131  When a Job is first admitted, kueue's job controller modifies it's pod template
   132  to inject a node selector coming from the ResourceFlavor.
   133  
   134  On preemption, the job controller resets the template back to the original
   135  nodeSelector, stored in the Workload spec (implementation)[https://github.com/kubernetes-sigs/kueue/blob/f24c63accaad461dfe582b21819dbf3a5d75dd60/pkg/controller/workload/job/job_controller.go#L246-251].
   136  
   137  ### Risks and Mitigations
   138  
   139  #### Workload preemption doesn't imply immediate Pod termination
   140  
   141  When Kueue issues a Workload preemption, the workload API integration controller
   142  is expected to start removing Pods.
   143  In the case of Kubernetes' batch.v1/Job, the following steps happen:
   144  1. Kueue's job controller sets the
   145  [`.spec.suspend`](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job)
   146  field to true.
   147  2. The Kubernetes' job controller deletes the Job's Pods.
   148  3. The Kubelets send SIGTERM signals to the Pod's containers, which can have a
   149     graceful termination logic.
   150  
   151  This implies the following:
   152  - Pods of a workload could implement checkpointing as part of their graceful
   153    termination.
   154  - The resources from these Pods are not immediately available and they could
   155    be arbitrarily delayed.
   156  - While Pods are terminating, a ClusterQueue's quota could be oversubscribed.
   157  
   158  The kubernetes Job status includes the number of Pending/Running Pods that are
   159  not terminating (don't have a `.metadata.deletionTimestamp`). We could use
   160  this information and write the old admission spec into an annotation to keep
   161  track of usage from non-terminating Pods. But this will be left for future work.
   162  
   163  ### Increased admission latency
   164  
   165  Calculating and executing preemption is expensive. Potentially, every
   166  workload might benefit from preemption of running Workloads.
   167  
   168  To mitigate this, we will keep track of the minimum priority among the running
   169  Workloads in a ClusterQueue. If the minimum priorities are higher than or equal
   170  to the incoming Workload, we will skip preemption for it altogether.
   171  
   172  The assumption is that workloads with low priority are more common than
   173  workloads with higher priority and that Workloads are sent to ClusterQueues
   174  where most Workloads have the same priority.
   175  
   176  Additionally, the preemption algorithm is mostly a linear pass over the running
   177  workloads (plus sorting), so it doesn't add a significant complexity overhead
   178  over building the scheduling snapshot every cycle.
   179  
   180  The API updates from preemption will be executed in parallel.
   181  
   182  ## Design Details
   183  
   184  The proposal consists of new API fields and a preemption algorithm.
   185  
   186  ### ClusterQueue API changes
   187  
   188  The new API fields in ClusterQueue describe how to influence the selection
   189  of Workloads to preempt.
   190  
   191  ```golang
   192  type ClusterQueueSpec struct {
   193    ...
   194    // preemption describes policies to preempt Workloads from this ClusterQueue
   195    // or the ClusterQueue's cohort.
   196    //
   197    // Preemption can happen in two scenarios:
   198    //
   199    // - When a Workload fits within the min quota of the ClusterQueue, but the
   200    //   quota is currently borrowed by other ClusterQueues in the cohort.
   201    //   Preempting Workloads in other ClusterQueues allows this ClusterQueue to
   202    //   reclaim its min quota.
   203    // - When a Workload doesn't fit within the min quota of the ClusterQueue
   204    //   and there are active Workloads with lower priority.
   205    //
   206    // The preemption algorithm tries to find a minimal set of Workloads to
   207    // preempt to accomodate the pending Workload, preempting Workloads with
   208    // lower priority first.
   209    preemption ClusterQueuePreemption
   210  }
   211  
   212  type PreemptionPolicy string
   213  
   214  const (
   215    PreemptionPoliyNever                    = "Never"
   216    PreemptionPoliyReclaimFromLowerPriority = "ReclaimFromLowerPriority"
   217    PreemptionPoliyReclaimFromAny           = "ReclaimFromAny"
   218    PreemptionPoliyLowerPriority            = "LowerPriority"
   219  )
   220  
   221  type ClusterQueuePreemption struct {
   222    // withinCohort determines whether a pending Workload can preempt Workloads
   223    // from other ClusterQueues in the cohort that are using more than their min
   224    // quota.
   225    // Possible values are:
   226    // - `Never` (default): do not preempt workloads in the cohort.
   227    // - `ReclaimFromLowerPriority`: if the pending workload fits within the min
   228    //   quota of its ClusterQueue, only preempt workloads in the cohort that have
   229    //   lower priority than the pending Workload.
   230    // - `ReclaimAny`: if the pending workload fits within the min quota of its
   231    //   ClusterQueue, preempt any workload in the cohort.
   232    WithinCohort PreemptionPolicy
   233  
   234    // withinClusterQueue determines whether a pending workload that doesn't fit
   235    // within the min quota for its ClusterQueue, can preempt active Workloads in
   236    // the ClusterQueue.
   237    // Possible values are:
   238    // - `Never` (default): do not preempt workloads in the ClusterQueue.
   239    // - `LowerPriority`: only preempt workloads in the ClusterQueue that have
   240    //   lower priority than the pending Workload.
   241    WithinClusterQueue PreemptionPolicy
   242  }
   243  ```
   244  
   245  ### Changes in scheduling algorithm
   246  
   247  The following changes in the scheduling algorithm are required to implement
   248  preemption.
   249  
   250  #### Detecting Workloads that might benefit from preemption
   251  
   252  The first stage during scheduling is to assign flavors to each resource of
   253  a workload.
   254  
   255  The algorithm is like follows:
   256  
   257      For each resource (or set of resources with the same flavors), evaluate
   258      flavors in the order established in the ClusterQueue*:
   259  
   260      0. Find a flavor that still has quota in the cohort (borrowing allowed),
   261         but doesn't surpass the max quota for the CQ. Keep track of whether
   262         borrowing was needed.
   263      1. [New step] if no flavor was found, find a flavor that is able to contain
   264         the request within the min quota of the ClusterQueue. This flavor
   265         assignment could be satisfied with preemption.
   266  
   267  Some highlights:
   268  - A Workload could get flavor assignments at different steps for different
   269    resources.
   270  - Assigments that require preemption implicitly do not borrow quota.
   271  
   272  A flavor assignment from step 1 means that we need to preempt or wait for other
   273  workloads in the cohort and/or ClusterQueue to finish to accommodate this
   274  workload.
   275  
   276  [#312](https://github.com/kubernetes-sigs/kueue/issues/312) discusses different
   277  strategies to select a flavor.
   278  
   279  #### Sorting Workloads that are heads of ClusterQueues
   280  
   281  Sorting uses the following criteria:
   282  
   283  1. Flavors that don't borrow first.
   284  2. [New criteria] Highest priority first.
   285  3. Older creation timestamp first.
   286  
   287  Note that these criteria might put Workloads that require preemption ahead,
   288  because preemption doesn't require borrowing more resources. This is desired,
   289  because preemption to recover quota or admit high priority Workloads takes
   290  preference over borrowing.
   291  
   292  #### Admission
   293  
   294  When iterating over workloads to be admitted, in the order given by the previous
   295  section, we disallow borrowing in the cohort in the current cycle after
   296  evaluating a Workload that doesn't require borrowing. This is the same behavior
   297  that we have today, but note that this criteria now includes Workloads that need
   298  preemption, because there is no preemption with borrowing quota.
   299  
   300  This guarantees that, in future cycles, we can admit Workloads that were not
   301  heads in their ClusterQueues in this cycle, but could fit without borrowing in
   302  the next cycle, before lending quota to other ClusterQueues.
   303  
   304  In the past, we only disallowed borrowing in the cohort if we were able to
   305  admit the Workload, because we only kept track of flavor assignments of type 0.
   306  This caused ClusterQueues in the cohort to continue
   307  borrowing quota, even if there were pending Workloads that would fit under the
   308  min Quota for their ClusterQueues.
   309  
   310  It is actually possible to limit borrowing within the cohort only for the
   311  flavors used by the evaluated Workloads, instead of restricting borrowing for
   312  all the flavors in the cohort. But we will leave this as a future possible
   313  optimization to improve throughput.
   314  
   315  #### Preemption
   316  
   317  For each Workload that got flavor assignments where preemption could help,
   318  we run the following algorithm:
   319  
   320  1. Check whether preemption is allowed and could help.
   321  
   322    We skip preemption if `.preemption.withinCohort=Never` and
   323    `.preemption.withinClusterQueue=Never`.
   324  
   325  2. Obtain a list of candidate Workloads to be preempted.
   326  
   327    1. In the cohort, we only consider ClusterQueues that are currently borrowing
   328       quota. We restrict the list to Workloads with lower priority than the
   329       pending Workload if `.preemption.withinCohort=ReclaimFromLowerPriority`
   330    2. In the ClusterQueue, we only select Workloads with lower
   331       priority than the pending Workload.
   332       
   333    To quickly list workloads with priority lower than the incoming workload,
   334    we can keep a priority queue with the priorities of active
   335    Workloads in the ClusterQueue.
   336  
   337    When going over these sets, we filter out the Workloads that are not using the
   338    flavors that were selected for the incoming Workload.
   339    
   340    If the list of candidates is empty, skip the rest of the preemption algorithm.
   341  
   342  3. Sort the Workloads using the following criteria:
   343     1. Workloads from other ClusterQueues in the cohort first.
   344     2. Lower priority first.
   345     3. Shortest running time first.
   346  
   347  4. Remove Workloads from the snapshot in the order of the list. Stop removing
   348     Workloads if the incoming Workload fits within the quota. Skip removing more
   349     Workloads from a ClusterQueue if its usage is already below its `min` quota
   350     for all the involved flavors.
   351  
   352     The set of removed Workloads is a maximal set of Workloads that need to be
   353     preempted.
   354  
   355  5. In the reverse order of the Workloads that were removed, add Workloads back
   356     as long as the incoming Workload still fits. This gives us a minimal set
   357     of Workloads to preempt.
   358  
   359  6. Preempt the Workloads by clearing `.spec.admission`.
   360     The Workload will be requeued by the Workload event handler.
   361  
   362  The incoming Workload wouldn't be admitted in this cycle. It is requeued and
   363  it will be admitted once the changes in the victim Workloads are observed and
   364  updated in the cache.
   365  
   366  ### Test Plan
   367  
   368  <!--
   369  **Note:** *Not required until targeted at a release.*
   370  The goal is to ensure that we don't accept enhancements with inadequate testing.
   371  
   372  All code is expected to have adequate tests (eventually with coverage
   373  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   374  when drafting this test plan.
   375  
   376  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   377  -->
   378  
   379  [x] I/we understand the owners of the involved components may require updates to
   380  existing tests to make this code solid enough prior to committing the changes necessary
   381  to implement this enhancement.
   382  
   383  ##### Prerequisite testing updates
   384  
   385  - Need to improve coverage of `pkg/queue` up to at least 80%.
   386  
   387  #### Unit Tests
   388  
   389  <!--
   390  In principle every added code should have complete unit test coverage, so providing
   391  the exact set of tests will not bring additional value.
   392  However, if complete unit test coverage is not possible, explain the reason of it
   393  together with explanation why this is acceptable.
   394  -->
   395  
   396  <!--
   397  Additionally, try to enumerate the core package you will be touching
   398  to implement this enhancement and provide the current unit coverage for those
   399  in the form of:
   400  - <package>: <date> - <current test coverage>
   401  
   402  This can inform certain test coverage improvements that we want to do before
   403  extending the production code to implement this enhancement.
   404  -->
   405  
   406  - `apis/kueue/webhooks`: `2022-11-17` - `72%`
   407  - `pkg/cache`: `2022-11-17` - `83%`
   408  - `pkg/scheduler`: `2022-11-17` - `91%`
   409  - `pkg/queue`: `2022-11-17` - `62%`
   410  
   411  #### Integration tests
   412  
   413  - No new workloads in the cohort can borrow when workloads in a ClusterQueue
   414    fit whitin their min quota (StrictFIFO and BestEffortFIFO), but there are
   415    running workloads.
   416  - Preemption within a ClusterQueue based on priority.
   417  - Preemption within a Cohort to reclaim min quota.
   418  
   419  #### E2E tests
   420  
   421  - Preemption within a ClusterQueue based on priority.
   422  
   423  <!--
   424  Describe what tests will be added to ensure proper quality of the enhancement.
   425  
   426  After the implementation PR is merged, add the names of the tests here.
   427  -->
   428  
   429  ### Graduation Criteria
   430  
   431  <!--
   432  
   433  Clearly define what it means for the feature to be implemented and
   434  considered stable.
   435  
   436  If the feature you are introducing has high complexity, consider adding graduation
   437  milestones with these graduation criteria:
   438  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   439  - [Feature gate][feature gate] lifecycle
   440  - [Deprecation policy][deprecation-policy]
   441  
   442  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   443  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   444  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   445  -->
   446  
   447  N/A
   448  
   449  ## Implementation History
   450  
   451  <!--
   452  Major milestones in the lifecycle of a KEP should be tracked in this section.
   453  Major milestones might include:
   454  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   455  - the `Proposal` section being merged, signaling agreement on a proposed design
   456  - the date implementation started
   457  - the first Kubernetes release where an initial version of the KEP was available
   458  - the version of Kubernetes where the KEP graduated to general availability
   459  - when the KEP was retired or superseded
   460  -->
   461  
   462  1. 2022-09-19: First draft, included multiple knobs.
   463  2. 2022-11-17: Complete proposal with minimal API.
   464  
   465  ## Drawbacks
   466  
   467  <!--
   468  Why should this KEP _not_ be implemented?
   469  -->
   470  
   471  Preemption is costly to calculate. However, it's a highly demanded feature.
   472  The API allows preemption to be opt-in.
   473  
   474  ## Alternatives
   475  
   476  The following APIs were initially proposed to enhance the control over
   477  preemption, but they were left out of this KEP for lack of strong use cases.
   478  
   479  We might add them back in the future, based on feedback.
   480  
   481  
   482  ### Allow high priority jobs to borrow quota while preempting
   483  
   484  The proposed policies for preemption within cohort require that the Workload
   485  fits within the min quota of the ClusterQueue. In other words, we don't try to
   486  borrow quota when preempting.
   487  
   488  It might be desired for higher priority workloads to preempt lower priority
   489  workloads that are borrowing resources, even if it makes the ClusterQueue
   490  borrow resources. This could be added as `.preemption.withinCohort=LowerPriority`.
   491  
   492  The implementation could be like the following:
   493  
   494  For each ClusterQueue, we consider the usage as the maximum of the min quota and
   495  the actual used quota. Then, we select flavors for the pending workload based on
   496  this simulated usage and run the preemption algorithm.
   497  
   498  **Reasons for discarding/deferring**
   499  
   500  It's unclear whether this behavior is useful and it adds complexity.
   501  
   502  ### Inform how costly is to interrupt a Workload
   503  
   504  A workload might have a known cost of interruption that varies over time.
   505  For example:
   506  Early in its execution, the Workload hasn't made much progress, so it can be
   507  preempted. Later, the Worload is on the path of doing significant progress, so
   508  it's best not to disturb it. Lastly, the Workload is expected to have made
   509  some checkpoints, so it's ok to disturb it.
   510  
   511  This could be expressed with the following configuration:
   512  
   513  ```yaml
   514  apiVersion: kueue.x-k8s.io/v1alpha2
   515  kind: Workload
   516  metadata:
   517    name: my-workload
   518  spec:
   519    preemption:
   520      disruptionCostMilestones:
   521      - seconds: 60
   522        cost: 100
   523      - seconds: 600
   524        cost: 0
   525  ```
   526  
   527  The cost is a linear interpolation of the configuration above. A graphical
   528  representation of the cost looks like the following (not to scale):
   529  
   530  ```
   531  cost
   532  
   533  100      __
   534          /  \___
   535         /       \___
   536        /            \___
   537  0    /                 \_
   538      0    60             600  time
   539  ```
   540  
   541  As a cluster administrator, I can configure default `disruptionCostMilestones`
   542  for certain workloads using webhooks or setting them for all Workloads in a
   543  LocalQueue.
   544  
   545  **Reasons for discarding/deferring**
   546  
   547  - Users could be incentivized to increase their cost.
   548  - Administrators might not be able to set a default that fits all users.
   549  - The use case in [#83](https://github.com/kubernetes-sigs/kueue/issues/83#issuecomment-1224602577)
   550    is mostly covered by `ClusterQueue.spec.waitBeforePreemptionSeconds`.
   551  
   552  A better approach would be that the workload actively publishes the cost of
   553  interrupting it, but this is an ongoing discussion upstream https://issues.k8s.io/107598
   554  
   555  ### Penalizing long running workloads
   556  
   557  A variant of the concept of cost to interrupt a workload is a penalty to
   558  Workloads that have been running for a long time. For example, by allowing them
   559  to be preempted by pending Workloads of the same priority after some time.
   560  
   561  One way this could be implemented is by introducing a concept of dynamic
   562  priority: the priority of a Workload could increase when they stay pending for
   563  a long time, or it could be reduced as the Workload keeps running.
   564  
   565  **Reasons for discarding/deferring**
   566  
   567  This can be implemented separately from the preemption APIs and algorithm, with
   568  specialized APIs to control priority. So it can be left for a different KEP.
   569  
   570  ### Terminating Workloads on preemption
   571  
   572  For some Workloads, it's not desired to restart them after preemption without
   573  some manual intervention or verification (for example, interactive jobs).
   574  
   575  This behavior could be configured like this:
   576  
   577  ```yaml
   578  apiVersion: kueue.x-k8s.io/v1alpha2
   579  kind: Workload
   580  metadata:
   581    name: my-workload
   582  spec:
   583    onPreemption: Terminate # OR Requeue (default)
   584  ```
   585  
   586  **Reasons for discarding/deferring**
   587  
   588  There is no clean mechanism to terminate a Job and all its running Pods.
   589  There are two means to terminate all running Pods of a Job, but they have
   590  some problems:
   591  
   592  1. Delete the Job. The pods will be deleted (gracefully) on cascade.
   593  
   594    This could mean loss of information for the end-user, unless they have a
   595    finalizer on the Job. In a sense, it violates
   596     [`ttlSecondsAfterFinish`](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/)
   597  
   598  2. Just suspend the Job.
   599    
   600     This option leaves a Job that is not finished, then it wouldn't be
   601     cleaned after `ttlSecondsAfterFinish`
   602     couldn't clean it up.
   603  
   604     Simply adding a `Failed` condition after suspending the Job could leave its
   605     Pods running indefinitely if the kubernetes job controller doesn't have a
   606     chance to delete all the Pods based on the `.spec.suspend` field.
   607  
   608  One possibility is to insert the `FailureTarget` condition in the Job status,
   609  introduced by [KEP#3329](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures)
   610  for a different purpose.
   611  
   612  Perhaps we should have an explicit API for this behavior, but it needs to be
   613  done upstream.
   614  
   615  Similar work needs to be done for workload CRDs.
   616  We should have an explicit API for this behavior
   617  
   618  ### Extra knobs in ClusterQueue preemption policy
   619  
   620  These extra knobs could enhance the control over preemption:
   621  
   622  ```golang
   623  type ClusterQueuePremption struct {
   624    // triggerAfterWorkloadWaitingSeconds is the time in seconds that Workloads
   625    // in this ClusterQueue will wait before triggering preemptions of active
   626    // workloads in this ClusterQueue or its cohort (when reclaiming quota).
   627    //
   628    // The time is measured from the first time the workload was attempted for
   629    // admission. This value is present as the `transitionTimestamp` of the
   630    // Admitted condition, with status=False.
   631    TriggerAfterWorkloadWaitingSeconds int64
   632  
   633    // workloadSorting determines how Workloads from the cohort that are
   634    // candidates for preemption are sorted.
   635    // Sorting happens at the time when a Workload in this ClusterQueue is
   636    // evaluated for admission. All the Workloads in the cohort are sorted based
   637    // on the criteria defined in the preempting ClusterQueue.
   638    // workloadOrder is a list of comparison criteria between two Workloads that
   639    // are evaluated in order.
   640    // Possible criteria are:
   641    // - ByLowestPriority: Prefer to preempt the Workload with lower priority.
   642    // - ByLowestRuntime: Prefer to preempt the Workload that started more
   643    //   recently.
   644    // - ByLongestRuntime: Prefer to preempt the Workload that started earlier.
   645    //
   646    // If empty, the behavior is equivalent to
   647    // [ByLowestPriority, ByLowestRuntime].
   648    WorkloadSorting []WorkloadComparison
   649  
   650    type WorkloadSortingCriteria string
   651  
   652    const (
   653      ComparisonByLowestPriority = "ByLowestPriority"
   654      ComparisonByLowestRuntime = "ByLowestRuntime"
   655    )
   656  }
   657  ```
   658  
   659  The proposed field `ClusterQueue.spec.preemption.triggerAfterWorkloadWaitingSeconds`
   660  can be interpreted in two ways:
   661  1. **How long jobs are willing to wait**.
   662     This shouldn't be problematic. The field can be configured based purely on
   663     the importance of the Workloads served by the preempting ClusterQueue.
   664  2. **The characteristics of the workloads in the cohort**; for example, how long
   665     they take to finish or how often they perform checkpointing, on average.
   666     This implies that all workloads in the cohort have similar characteristics
   667     and all the ClusterQueues in the cohort should have the same wait period.
   668  
   669  This caveat should be part of the documentation as a best practice for how to
   670  setup the field.
   671  
   672  
   673  **Reasons for discarding/deferring**
   674  
   675  The usefulness of the field `triggerAfterWorkloadWaitingSeconds` is somewhat
   676  questionable when the ClusterQueue is saturated (all the workloads require
   677  preemption). If the ClusterQueue is in `BestEffortFIFO` mode, it's possible
   678  that all the elements will trigger preemption once the deadline for at least
   679  one Workload is satisfied.
   680  
   681  For simplicity of the API, we will start with implicit sorting rules.