sigs.k8s.io/kueue@v0.6.2/keps/1337-preempt-within-cohort-while-borrowing/README.md (about)

     1  # KEP-1337: Preempt within cohort while borrowing
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Goals](#goals)
    23    - [Non-Goals](#non-goals)
    24  - [Proposal](#proposal)
    25    - [User Stories (Optional)](#user-stories-optional)
    26      - [Story 1](#story-1)
    27    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    28    - [Risks and Mitigations](#risks-and-mitigations)
    29      - [Flopping between preempting pair of workloads](#flopping-between-preempting-pair-of-workloads)
    30  - [Design Details](#design-details)
    31    - [API](#api)
    32    - [Implementation overview](#implementation-overview)
    33    - [Test Plan](#test-plan)
    34        - [Prerequisite testing updates](#prerequisite-testing-updates)
    35      - [Unit Tests](#unit-tests)
    36      - [Integration tests](#integration-tests)
    37    - [Graduation Criteria](#graduation-criteria)
    38      - [Beta](#beta)
    39      - [GA](#ga)
    40  - [Implementation History](#implementation-history)
    41  - [Drawbacks](#drawbacks)
    42  - [Alternatives](#alternatives)
    43    - [Reuse PreemptionPolicy enum](#reuse-preemptionpolicy-enum)
    44    - [Minimal API](#minimal-api)
    45    - [ReclaimWithinCohortConfig](#reclaimwithincohortconfig)
    46    - [Move ReclaimWithinCohort under ReclaimWithinCohortConfig](#move-reclaimwithincohort-under-reclaimwithincohortconfig)
    47  <!-- /toc -->
    48  
    49  ## Summary
    50  
    51  This KEP introduces ClusterQueue-level API to support workload preemption
    52  to reclaim quota within cohort for a workload that requires borrowing.
    53  
    54  <!--
    55  This section is incredibly important for producing high-quality, user-focused
    56  documentation such as release notes or a development roadmap. It should be
    57  possible to collect this information before implementation begins, in order to
    58  avoid requiring implementors to split their attention between writing release
    59  notes and implementing the feature itself. KEP editors and SIG Docs
    60  should help to ensure that the tone and content of the `Summary` section is
    61  useful for a wide audience.
    62  
    63  A good summary is probably at least a paragraph in length.
    64  
    65  Both in this section and below, follow the guidelines of the [documentation
    66  style guide]. In particular, wrap lines to a reasonable length, to make it
    67  easier for reviewers to cite specific portions, and to minimize diff churn on
    68  updates.
    69  
    70  [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
    71  -->
    72  
    73  ## Motivation
    74  
    75  Currently, there is no support for preemption to reclaim quota within cohort
    76  for workloads which require borrowing. This is too constraining for some setups.
    77  
    78  <!--
    79  This section is for explicitly listing the motivation, goals, and non-goals of
    80  this KEP.  Describe why the change is important and the benefits to users. The
    81  motivation section can optionally provide links to [experience reports] to
    82  demonstrate the interest in a KEP within the wider Kubernetes community.
    83  
    84  [experience reports]: https://github.com/golang/go/wiki/ExperienceReports
    85  -->
    86  
    87  ### Goals
    88  
    89  - support preemption to reclaim quota within cohort for workloads that require borrowing.
    90  - support preemption within cluster queue for workloads that require borrowing.
    91  - modify the workload ordering by scheduler to take the need to preempt into account.
    92  
    93  <!--
    94  List the specific goals of the KEP. What is it trying to achieve? How will we
    95  know that this has succeeded?
    96  -->
    97  
    98  ### Non-Goals
    99  
   100  - allow preemption of workloads, from other ClusterQueues, which are not borrowing.
   101  
   102  <!--
   103  What is out of scope for this KEP? Listing non-goals helps to focus discussion
   104  and make progress.
   105  -->
   106  
   107  ## Proposal
   108  
   109  <!--
   110  This is where we get down to the specifics of what the proposal actually is.
   111  This should have enough detail that reviewers can understand exactly what
   112  you're proposing, but should not include things like API designs or
   113  implementation. What is the desired outcome and how do we measure success?.
   114  The "Design Details" section below is for the real
   115  nitty-gritty.
   116  -->
   117  
   118  Introduce a new ClusterLevel API to support preemption while borrowing. The
   119  API proposes additional configuration to the `.preemption.reclaimWithinCohort`
   120  option.
   121  
   122  ### User Stories (Optional)
   123  
   124  <!--
   125  Detail the things that people will be able to do if this KEP is implemented.
   126  Include as much detail as possible so that people can understand the "how" of
   127  the system. The goal here is to make this feel real for users without getting
   128  bogged down.
   129  -->
   130  
   131  #### Story 1
   132  
   133  As an administrator of Kueue in my organization I manage the following setup
   134  with two teams, `A` and `B`. There is a single cohort, and both teams have a
   135  pair of ClusterQueues in the cohort, called `Standard` and `Best-Effort`. There
   136  is also one ClusterQueue, called `Shared`, which is shared between the teams.
   137  
   138  All the `Standard` and `Best-Effort` ClusterQueues have `nominalQuota=0`, and
   139  `borrowingLimit>0`. The actual quota is exposed by the `Shared` ClusterQueue.
   140  The `Shared` ClusterQueue has `nominalQuota>0`, and `borrowingLimit=0`. This
   141  setup is depicted by the table:
   142  
   143  |                 | A Standard    | A Best-Effort | B Standard    | B Best-Effort | Shared        |
   144  |-----------------|---------------|---------------|---------------|---------------|---------------|
   145  |nominalQuota     | 0             | 0             | 0             | 0             | 100           |
   146  |borrowingLimit   | 100           | 100           | 100           | 100           | 0             |
   147  
   148  The teams send their high-priority (say `>200`) jobs to their `Standard`
   149  ClusterQueue, and the low-priority jobs (say `<100`) to their `Best-Effort`
   150  ClusterQueue. Jobs sent to the `Standard` ClusterQueues should not be preempted.
   151  Jobs in the `Best-Effort` ClusterQueues can be preempted with respect to
   152  priorities.
   153  
   154  Note that in this setup every workload is borrowing, so I need to make preemption
   155  while borrowing enabled. Additionally, I need a mechanism to say that workloads
   156  from the `Standard` ClusterQueues are not preempted. A mechanism to only preempt
   157  jobs with priorities below a specified threshold will suffice.
   158  
   159  ### Notes/Constraints/Caveats (Optional)
   160  
   161  <!--
   162  What are the caveats to the proposal?
   163  What are some important details that didn't come across above?
   164  Go in to as much detail as necessary here.
   165  This might be a good place to talk about core concepts and how they relate.
   166  -->
   167  
   168  ### Risks and Mitigations
   169  
   170  #### Flopping between preempting pair of workloads
   171  
   172  In a setup as described in (Story 1)[#story-1] consider there are two workloads,
   173  one in ClusterQueue A, and one in ClusterQueue B. There is not enough resources
   174  in the cohort to run both workloads at the same time.
   175  
   176  If the preemption is allowed for workloads of any priorities
   177  (as when `.preemption.reclaimWithinCohort` equals `Any`)
   178  then the following flopping happens: workload A preempts workload B, then
   179  preempted workload B returns to the queue and preempts workload A to make progress.
   180  This repeats, and none of the workloads can make any progress.
   181  
   182  In order to mitigate this risk we only enable the preemption to reclaim
   183  quota within cohort, for workloads requiring borrowing when such preempted
   184  workloads are of lower priority. This behavior is enabled by setting
   185  `.preemption.borrowWithinCohort.policy: LowerPriority`.
   186  
   187  <!--
   188  What are the risks of this proposal, and how do we mitigate? Think broadly.
   189  For example, consider both security and how this will impact the larger
   190  Kubernetes ecosystem.
   191  
   192  How will security be reviewed, and by whom?
   193  
   194  How will UX be reviewed, and by whom?
   195  
   196  Consider including folks who also work outside the SIG or subproject.
   197  -->
   198  
   199  ## Design Details
   200  
   201  ### API
   202  
   203  ```golang
   204  // ClusterQueuePreemption contains policies to preempt Workloads from this
   205  // ClusterQueue or the ClusterQueue's cohort.
   206  type ClusterQueuePreemption struct {
   207    ...
   208  	// borrowWithinCohort provides configuration to allow preemption within
   209  	// cohort while borrowing. This configuration can only be enabled when
   210  	// reclaimWithinCohort is not `Never`.
   211  	BorrowWithinCohort *BorrowWithinCohort `json:"borrowWithinCohort,omitempty"`
   212    ...
   213  }
   214  
   215  type BorrowWithinCohortPolicy string
   216  
   217  const (
   218  	BorrowWithinCohortPolicyNever         BorrowWithinCohortPolicy = "Never"
   219  	BorrowWithinCohortPolicyLowerPriority BorrowWithinCohortPolicy = "LowerPriority"
   220  )
   221  
   222  // BorrowWithinCohort contains configuration which allows to preempt workloads
   223  // within cohort while borrowing.
   224  type BorrowWithinCohort struct {
   225  	// policy determines the policy for preemption to reclaim quota within cohort while borrowing.
   226  	// Possible values are:
   227  	// - `Never` (default): do not allow for preemption, in other
   228  	//    ClusterQueues within the cohort, for a borrowing workload.
   229  	// - `LowerPriority`: allow preemption, in other ClusterQueues
   230  	//    within the cohort, for a borrowing workload, but only if
   231  	//    the preempted workloads are of lower priority.
   232  	//
   233  	// +kubebuilder:default=Never
   234  	// +kubebuilder:validation:Enum=Never;LowerPriority
   235  	Policy BorrowWithinCohortPolicy `json:"policy,omitempty"`
   236  
   237  	// maxPriorityThreshold allows to restrict the set of workloads which
   238  	// might be preempted by a borrowing workload, to only workloads with
   239  	// priority below or equal the specified level.
   240  	//
   241  	// +optional
   242  	MaxPriorityThreshold *int32 `json:"maxPriorityThreshold,omitempty"`
   243  }
   244  ```
   245  
   246  ### Implementation overview
   247  
   248  There are main modifications:
   249  - in the [`fitsResourceQuota`](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/scheduler/flavorassigner/flavorassigner.go#L551)
   250    we return `FlavorAssignmentMode=Preempt` when the workload does not fit, but
   251    there is enough quota within the cohort
   252  - modify the [`GetTargets`](https://github.com/kubernetes-sigs/kueue/blob/53d4b657655aff5df3748e65b4226cf11e50eeba/pkg/scheduler/preemption/preemption.go#L77)
   253  and the [`minimalPreemptions`](https://github.com/kubernetes-sigs/kueue/blob/53d4b657655aff5df3748e65b4226cf11e50eeba/pkg/scheduler/preemption/preemption.go#L159) functions
   254  to only, if borrowing, include candidates which satisfy the new API's constrains.
   255  
   256  Additionally, we add a webhook validation in `clusterqueue_webhook` to
   257  verify that `.preemption.reclaimWithinCohort` is not `Never`, when
   258  `.preemption.borrowWithinCohort` is enabled.
   259  
   260  <!--
   261  This section should contain enough information that the specifics of your
   262  change are understandable. This may include API specs (though not always
   263  required) or even code snippets. If there's any ambiguity about HOW your
   264  proposal will be implemented, this is the place to discuss them.
   265  -->
   266  
   267  ### Test Plan
   268  
   269  <!--
   270  **Note:** *Not required until targeted at a release.*
   271  The goal is to ensure that we don't accept enhancements with inadequate testing.
   272  
   273  All code is expected to have adequate tests (eventually with coverage
   274  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   275  when drafting this test plan.
   276  
   277  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   278  -->
   279  
   280  [x] I/we understand the owners of the involved components may require updates to
   281  existing tests to make this code solid enough prior to committing the changes necessary
   282  to implement this enhancement.
   283  
   284  ##### Prerequisite testing updates
   285  
   286  <!--
   287  Based on reviewers feedback describe what additional tests need to be added prior
   288  implementing this enhancement to ensure the enhancements have also solid foundations.
   289  -->
   290  
   291  #### Unit Tests
   292  
   293  <!--
   294  In principle every added code should have complete unit test coverage, so providing
   295  the exact set of tests will not bring additional value.
   296  However, if complete unit test coverage is not possible, explain the reason of it
   297  together with explanation why this is acceptable.
   298  -->
   299  
   300  <!--
   301  Additionally, try to enumerate the core package you will be touching
   302  to implement this enhancement and provide the current unit coverage for those
   303  in the form of:
   304  - <package>: <date> - <current test coverage>
   305  
   306  This can inform certain test coverage improvements that we want to do before
   307  extending the production code to implement this enhancement.
   308  -->
   309  
   310  - `pkg/scheduler/flavorassigner`: `7 Dec 2023` - `79.2%` (to indicate the workload which needs to borrow, that it can also preempt)
   311  - `pkg/scheduler/preemption`: `7 Dec 2023` - `94%` (to update the `GetTargets` function, identifying the set of workloads to preempt)
   312  
   313  #### Integration tests
   314  
   315  We are going to cover the following scenarios:
   316  - in the setup described in the [Story 1](#story-1), verify:
   317    - a workload which require borrowing can preempt a lower priority workload
   318    - a workload which require borrowing cannot preempt a higher priority workload
   319    - a workload which require borrowing cannot preempt a lower priority (but above the threshold) workload.
   320    - a workload which require borrowing cannot preempt a lower priority workload if the workload isn't borrowing.
   321  
   322  More scenarios can be considered based on the actual implementation.
   323  
   324  <!--
   325  Describe what tests will be added to ensure proper quality of the enhancement.
   326  
   327  After the implementation PR is merged, add the names of the tests here.
   328  -->
   329  
   330  ### Graduation Criteria
   331  
   332  <!--
   333  
   334  Clearly define what it means for the feature to be implemented and
   335  considered stable.
   336  
   337  If the feature you are introducing has high complexity, consider adding graduation
   338  milestones with these graduation criteria:
   339  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   340  - [Feature gate][feature gate] lifecycle
   341  - [Deprecation policy][deprecation-policy]
   342  
   343  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   344  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   345  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   346  -->
   347  
   348  #### Beta
   349  
   350  - the feature is implemented and documented as opt-in
   351  
   352  #### GA
   353  
   354  - all reported bugs are fixed
   355  
   356  ## Implementation History
   357  
   358  <!--
   359  Major milestones in the lifecycle of a KEP should be tracked in this section.
   360  Major milestones might include:
   361  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   362  - the `Proposal` section being merged, signaling agreement on a proposed design
   363  - the date implementation started
   364  - the first Kubernetes release where an initial version of the KEP was available
   365  - the version of Kubernetes where the KEP graduated to general availability
   366  - when the KEP was retired or superseded
   367  -->
   368  
   369  ## Drawbacks
   370  
   371  Increased complexity of the preemption logic in Kueue, which is already complex.
   372  
   373  <!--
   374  Why should this KEP _not_ be implemented?
   375  -->
   376  
   377  ## Alternatives
   378  
   379  ### Reuse PreemptionPolicy enum
   380  
   381  ```golang
   382  type ClusterQueuePreemption struct {
   383    ...
   384  	ReclaimWithinCohortConfig *ReclaimWithinCohortConfig `json:"reclaimWithCohort,omitempty"`
   385    ...
   386  }
   387  
   388  type ReclaimWithinCohortConfig struct {
   389  	// +kubebuilder:default=Never
   390  	// +kubebuilder:validation:Enum=Never;LowerPriority
   391  	WhileBorrowingPolicy PreemptionPolicy `json:"whileBorrowingPolicy,omitempty"`
   392  
   393  	// +optional
   394  	PriorityThreshold *int32 `json:"priorityThreshold,omitempty"`
   395  }
   396  ```
   397  
   398  **Reasons for discarding/deferring**
   399  
   400  The existing type has some extra options, such as `Any` which would need to be
   401  rejected by validation, this is not ideal from user's perspective.
   402  
   403  ### Minimal API
   404  
   405  Since we only have one policy `LowerPriority`, the minimal API would be just
   406  to have a boolean toggle with the `Enable` field. For example:
   407  
   408  ```golang
   409  type ReclaimWithinCohortConfig struct {
   410  	// +kubebuilder:default=false
   411  	EnableWhileBorrowing bool `json:"enableWhileBorrowing,omitempty"`
   412  
   413  	// +optional
   414  	PriorityThreshold *int32 `json:"priorityThreshold,omitempty"`
   415  }
   416  ```
   417  
   418  Pros:
   419  - simpler, yet provides the same functionality
   420  
   421  **Reasons for discarding/deferring**
   422  
   423  The `EnableWhileBorrowing` field might become redundant if we introduce the `Policy` field in the future.
   424  
   425  There might be a need to introduce `Any` option as for other preemption modes in the future.
   426  Note that, the same risk as [Flopping between preempting pair of workloads](#flopping-between-preempting-pair-of-workloads)
   427  would also applies when `preemption.withinClusterQueue: Any`, yet we offer `Any`
   428  as an option.
   429  
   430  ### ReclaimWithinCohortConfig
   431  
   432  ```golang
   433  type ClusterQueuePreemption struct {
   434    ...
   435  	ReclaimWithinCohortConfig *ReclaimWithinCohortConfig `json:"reclaimWithCohort,omitempty"`
   436    ...
   437  }
   438  
   439  type ReclaimWithinCohortWhileBorrowingPolicy string
   440  
   441  const (
   442  	ReclaimWithinCohortWhileBorrowingPolicyNever         ReclaimWithinCohortWhileBorrowingPolicy = "Never"
   443  	ReclaimWithinCohortWhileBorrowingPolicyLowerPriority ReclaimWithinCohortWhileBorrowingPolicy = "LowerPriority"
   444  )
   445  
   446  type ReclaimWithinCohortConfig struct {
   447  	// +kubebuilder:default=Never
   448  	// +kubebuilder:validation:Enum=Never;LowerPriority
   449  	WhileBorrowingPolicy ReclaimWithinCohortWhileBorrowingPolicy `json:"whileBorrowingPolicy,omitempty"`
   450  
   451  	// +optional
   452  	PriorityThreshold *int32 `json:"priorityThreshold,omitempty"`
   453  }
   454  ```
   455  
   456  **Reasons for discarding/deferring**
   457  
   458  The field names for `ReclaimWithinCohortConfig` and `WhileBorrowingPolicy` are lengthy.
   459  
   460  ### Move ReclaimWithinCohort under ReclaimWithinCohortConfig
   461  
   462  We could consider moving the existing enum field `.preemption.reclaimWithinCohort`
   463  under the new configuration structure:
   464  
   465  ```golang
   466  type ReclaimWithinCohortConfig struct {
   467  	// +kubebuilder:default=Never
   468  	// +kubebuilder:validation:Enum=Never;LowerPriority;Any
   469  	Policy PreemptionPolicy `json:"policy,omitempty"`
   470  
   471  	// +kubebuilder:default=Never
   472  	// +kubebuilder:validation:Enum=Never;LowerPriority
   473  	WhileBorrowingPolicy PreemptionPolicy `json:"whileBorrowingPolicy,omitempty"`
   474  
   475  	// +optional
   476  	PriorityThreshold *int32 `json:"priorityThreshold,omitempty"`
   477  }
   478  ```
   479  
   480  This might help to make the API easier to understand by putting the all the
   481  configuration options for `reclaimWithinCohort` together.
   482  
   483  **Reasons for discarding/deferring**
   484  
   485  Kueue API is already Beta, and we would need a process to deprecate the existing
   486  API. It seems that the gain to effort ratio is not good.