sigs.k8s.io/kueue@v0.6.2/keps/582-preempt-based-on-flavor-order/README.md

sigs.k8s.io/kueue@v0.6.2/keps/582-preempt-based-on-flavor-order/README.md (about)

     1  # KEP-582: Preempt Based On Flavor Order
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Goals](#goals)
    23    - [Non-Goals](#non-goals)
    24  - [Proposal](#proposal)
    25    - [User Stories (Optional)](#user-stories-optional)
    26      - [Story 1](#story-1)
    27    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    28    - [Risks and Mitigations](#risks-and-mitigations)
    29  - [Design Details](#design-details)
    30    - [Cluster Queue API](#cluster-queue-api)
    31    - [Behavior Changes](#behavior-changes)
    32    - [Implementation](#implementation)
    33    - [Test Plan](#test-plan)
    34        - [Prerequisite testing updates](#prerequisite-testing-updates)
    35      - [Unit Tests](#unit-tests)
    36      - [Integration tests](#integration-tests)
    37    - [Graduation Criteria](#graduation-criteria)
    38  - [Implementation History](#implementation-history)
    39  <!-- /toc -->
    40  
    41  ## Summary
    42  
    43  <!--
    44  This section is incredibly important for producing high-quality, user-focused
    45  documentation such as release notes or a development roadmap. It should be
    46  possible to collect this information before implementation begins, in order to
    47  avoid requiring implementors to split their attention between writing release
    48  notes and implementing the feature itself. KEP editors and SIG Docs
    49  should help to ensure that the tone and content of the `Summary` section is
    50  useful for a wide audience.
    51  
    52  A good summary is probably at least a paragraph in length.
    53  
    54  Both in this section and below, follow the guidelines of the [documentation
    55  style guide]. In particular, wrap lines to a reasonable length, to make it
    56  easier for reviewers to cite specific portions, and to minimize diff churn on
    57  updates.
    58  
    59  [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
    60  -->
    61  This proposal introduces an opt-in mechanism to borrow quota or preempt workloads in a flavor
    62  before trying the next flavors in the ClusterQueue.
    63  
    64  ## Motivation
    65  
    66  <!--
    67  This section is for explicitly listing the motivation, goals, and non-goals of
    68  this KEP.  Describe why the change is important and the benefits to users. The
    69  motivation section can optionally provide links to [experience reports] to
    70  demonstrate the interest in a KEP within the wider Kubernetes community.
    71  
    72  [experience reports]: https://github.com/golang/go/wiki/ExperienceReports
    73  -->
    74  
    75  The order of ResourceFlavors within a ClusterQueue represents preference of 
    76  consumption. Jobs with higher priorities sometimes prefer to consume resources
    77  in preferred ResourceFlavors.
    78  
    79  ### Goals
    80  
    81  <!--
    82  List the specific goals of the KEP. What is it trying to achieve? How will we
    83  know that this has succeeded?
    84  -->
    85  - a mechanism to enable high priority jobs preempt low priority jobs using a flavor or borrow before considering the
    86    next resource flavor when scheduling
    87  
    88  ### Non-Goals
    89  
    90  - change the behavior to judge whether a podset can get enough resource in certain resource flavor. 
    91  - change the preemption and admission precess.
    92  <!--
    93  What is out of scope for this KEP? Listing non-goals helps to focus discussion
    94  and make progress.
    95  -->
    96  
    97  ## Proposal
    98  
    99  <!--
   100  This is where we get down to the specifics of what the proposal actually is.
   101  This should have enough detail that reviewers can understand exactly what
   102  you're proposing, but should not include things like API designs or
   103  implementation. What is the desired outcome and how do we measure success?.
   104  The "Design Details" section below is for the real
   105  nitty-gritty.
   106  -->
   107  
   108  ### User Stories (Optional)
   109  
   110  <!--
   111  Detail the things that people will be able to do if this KEP is implemented.
   112  Include as much detail as possible so that people can understand the "how" of
   113  the system. The goal here is to make this feel real for users without getting
   114  bogged down.
   115  -->
   116  
   117  #### Story 1
   118  
   119  As a Kueue administrator I want to ensure more important jobs running on more 
   120  stable resources. This can happen in case that there are normal and spot instances
   121  in my cluster. In this case I prefer my high priority jobs not running on spot 
   122  instances. If high priority jobs can preempt jobs in standard instances before trying spot instances,
   123  stability can be achieved.
   124  
   125  My use case can be supported by setting `.Spec.FlavorFungibility.WhenCanPreempt` to `Preempt`  in the ClusterQueue's spec.
   126  
   127  ### Notes/Constraints/Caveats (Optional)
   128  
   129  <!--
   130  What are the caveats to the proposal?
   131  What are some important details that didn't come across above?
   132  Go in to as much detail as necessary here.
   133  This might be a good place to talk about core concepts and how they relate.
   134  -->
   135  
   136  ### Risks and Mitigations
   137  
   138  <!--
   139  What are the risks of this proposal, and how do we mitigate? Think broadly.
   140  For example, consider both security and how this will impact the larger
   141  Kubernetes ecosystem.
   142  
   143  How will security be reviewed, and by whom?
   144  
   145  How will UX be reviewed, and by whom?
   146  
   147  Consider including folks who also work outside the SIG or subproject.
   148  -->
   149  
   150  ## Design Details
   151  
   152  <!--
   153  This section should contain enough information that the specifics of your
   154  change are understandable. This may include API specs (though not always
   155  required) or even code snippets. If there's any ambiguity about HOW your
   156  proposal will be implemented, this is the place to discuss them.
   157  -->
   158  
   159  ### Cluster Queue API
   160  
   161  We extend the Cluster Queue API to introduce the new fields: flavorFungibility to opt-in and configure the new behavior.
   162  
   163  For each type of resource in each podSet, Kueue will traverse all resource groups and resource flavors to find a available flavor in present. When there are insufficient resources in the flavor, kueue will prioritize preemption or borrowing based on the configured policy. 
   164  
   165  ```
   166  const (
   167  	Borrow FlavorFungibilityPolicy = "Borrow"
   168  	Preempt  FlavorFungibilityPolicy = "Preempt"
   169    TryNextFlavor FlavorFungibilityPolicy = "TryNextFlavor"
   170  )
   171  
   172  type FlavorFungibility struct {
   173    // +kubebuilder:validation:Enum="Borrow,TryNextFlavor"
   174    WhenCanBorrow FlavorFungibilityPolicy  `json:"whenCanBorrow"`
   175    // +kubebuilder:validation:Enum="Preempt,TryNextFlavor"
   176    WhenCanPreempt FlavorFungibilityPolicy `json:"whenCanPreempt"`
   177  }
   178  
   179  // ClusterQueueSpec defines the desired state of ClusterQueue
   180  type ClusterQueueSpec struct {
   181  	...
   182  	FlavorFungibility FlavorFungibility `json:"flavorFungibility"`
   183  }
   184  ```
   185  
   186  If flavorFungibility is nil in configuration, we will set the `WhenCanBorrow` to `Borrow` and set `WhenCanPreempt` to `TryNextFlavor` to maintain consistency with the current behavior.
   187  
   188  ### Behavior Changes
   189  
   190  We will not change the behavior to judge whether a podset can get enough resource in certain resource flavor. Preemption and admission will not be influenced also. We only change the order these flavors were considered.
   191  
   192  After we try to schedule a podset in a resource flavor, we decide whether to traverse to the next flavor base on the `flavorFungibility`. If the assignment mode is `NoFit`, we will always try the next flavor until the last one. When the assignment mode is `Preempt`, we can return the currenty assignment if `WhenCanPreempt` is `Preempt`. Otherwise if the assignment mode is `Fit`, we try the next flavor only when we need borrowing in the current flavor and `WhenCanBorrow` is `TryNextFlavor`.
   193  
   194  We will store the scheduling context in workload info so that we can start from where we stop in previous scheduling attempts. This will be useful to avoid to waste time in one flavor all the time if we try to preempt in a flavor and failed. Scheduling context will contain the `LastScheduledFlavorIdx`, `ClusterQueueGeneration` attached to the CQ and `CohortGeneration`. Any changes to these properties will lead to a scheduling from the first flavor.
   195  
   196  `ClusterQueueGeneration` and `CohortGeneration` mark record the resource consumption of the CQs and Cohort. Any time the available resources of the CQs or Cohort increase, we will increase the genreation. So that if the Generation in scheduling context is lower, we should retry from the first flavor. Note that increasing after decreasing of the available resource will also make the generation increased, but I think this is acceptable since we can save the memory by just storing the generation instead of the usage state for each scheduling attempt.
   197  
   198  For example, if cluster queue has 2 resource groups and workload has 1 podSet as the following:
   199  
   200  ```
   201  ...
   202    - coveredResources: ["cpu", "memory"]
   203      flavors:
   204      - name: "default-flavor1"
   205        resources:
   206        - name: "cpu"
   207          nominalQuota: 3 
   208        - name: "memory"
   209          nominalQuota: 600Mi 
   210      - name: "default-flavor2"
   211        resources:
   212        - name: "cpu"
   213          nominalQuota: 3 
   214        - name: "memory"
   215          nominalQuota: 600Mi 
   216    - coveredResources: ["gpu"]
   217      flavors:
   218      - name: "vendor1"
   219        resources:
   220        - name: "gpu"
   221          nominalQuota: 9  
   222      - name: "vendor2"
   223        resources:
   224        - name: "gpu"
   225          nominalQuota: 9  
   226  ---  
   227  ...
   228    podSets:
   229    - count: 3
   230      spec:
   231        containers:
   232        - ...
   233          resources:
   234            requests:
   235              cpu: "1"
   236              memory: 200Mi
   237              gpu: 1
   238  ```
   239  
   240  We will first try `default-flavor1` for cpu and memory resources. If `default-flavor1` doesn't fit, we try preempt in `default-flavor1`. And if we can not find enough candidates in `default-flavor1`, the workload will start from `default-flavor2` in the next time.
   241  
   242  ### Implementation
   243  
   244  ```
   245  func assignFlavors(log logr.Logger, requests []workload.PodSetResources, podSets []kueue.PodSet, resourceFlavors map[kueue.ResourceFlavorReference]*kueue.ResourceFlavor, cq *cache.ClusterQueue, lastAssignment *workload.AssigmentClusterQueueState) Assignment {
   246  	var assignment Assignment
   247  	if lastAssignment != nil {
   248  		assignment = Assignment{
   249  			TotalBorrow: make(cache.FlavorResourceQuantities),
   250  			PodSets:     make([]PodSetAssignment, 0, len(requests)),
   251  			LastState:   *lastAssignment,
   252  			Usage:       make(cache.FlavorResourceQuantities),
   253  		}
   254  	} else {
   255  		assignment = Assignment{
   256  			TotalBorrow: make(cache.FlavorResourceQuantities),
   257  			PodSets:     make([]PodSetAssignment, 0, len(requests)),
   258  			LastState: workload.AssigmentClusterQueueState{
   259  				LastAssignedFlavorIdx:  make([]map[corev1.ResourceName]int, 0, len(podSets)),
   260  				CohortGeneration:       0,
   261  				ClusterQueueGeneration: cq.Generation,
   262  			},
   263  			Usage: make(cache.FlavorResourceQuantities),
   264  		}
   265  		if cq.Cohort != nil {
   266  			assignment.LastState.CohortGeneration = cq.Cohort.Generation
   267  		}
   268  	}
   269    ...
   270  }
   271  
   272  func shouldTryNextFlavor(representativeMode FlavorAssignmentMode, flavorFungibility v1beta1.FlavorFungibility, whetherNeedBorrowing bool) bool {
   273  	policyPreempt := flavorFungibility.WhenCanPreempt
   274  	policyBorrow := flavorFungibility.WhenCanBorrow
   275  	if representativeMode == Preempt && policyPreempt == v1beta1.Preempt {
   276  		return false
   277  	}
   278  
   279  	if representativeMode == Fit && whetherNeedBorrowing && policyBorrow == v1beta1.Borrow {
   280  		return false
   281  	}
   282  
   283  	if representativeMode == Fit && !whetherNeedBorrowing {
   284  		return false
   285  	}
   286  
   287  	return true
   288  }
   289  ```
   290  
   291  ### Test Plan
   292  
   293  <!--
   294  **Note:** *Not required until targeted at a release.*
   295  The goal is to ensure that we don't accept enhancements with inadequate testing.
   296  
   297  All code is expected to have adequate tests (eventually with coverage
   298  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   299  when drafting this test plan.
   300  
   301  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   302  -->
   303  
   304  [Y] I/we understand the owners of the involved components may require updates to
   305  existing tests to make this code solid enough prior to committing the changes necessary
   306  to implement this enhancement.
   307  
   308  ##### Prerequisite testing updates
   309  
   310  <!--
   311  Based on reviewers feedback describe what additional tests need to be added prior
   312  implementing this enhancement to ensure the enhancements have also solid foundations.
   313  -->
   314  
   315  #### Unit Tests
   316  
   317  <!--
   318  In principle every added code should have complete unit test coverage, so providing
   319  the exact set of tests will not bring additional value.
   320  However, if complete unit test coverage is not possible, explain the reason of it
   321  together with explanation why this is acceptable.
   322  -->
   323  
   324  <!--
   325  Additionally, try to enumerate the core package you will be touching
   326  to implement this enhancement and provide the current unit coverage for those
   327  in the form of:
   328  - <package>: <date> - <current test coverage>
   329  
   330  This can inform certain test coverage improvements that we want to do before
   331  extending the production code to implement this enhancement.
   332  -->
   333  
   334  - `pkg/cache`: `2023-8-22` - `82.9%`
   335  - `pkg/scheduler`: `2023-8-22` - `80.7%`
   336  - `pkg/webhook`: `2023-8-22` - `71.2%`
   337  - `pkg/workload`: `2023-8-22` - `54.9%`
   338  
   339  #### Integration tests
   340  
   341  <!--
   342  Describe what tests will be added to ensure proper quality of the enhancement.
   343  
   344  After the implementation PR is merged, add the names of the tests here.
   345  -->
   346  Scenarios that `WhenCanBorrow` is set as `Borrow` and `WhenCanPreempt` is set as `tryNextFlavor` are same with current behavior. So the added integration tests will these cover scenarios:
   347  
   348  - `WhenCanBorrow` is set as `tryNextFlavor`,
   349  - `WhenCanPreempt` is set as `Preempt`.
   350  
   351  ### Graduation Criteria
   352  
   353  <!--
   354  
   355  Clearly define what it means for the feature to be implemented and
   356  considered stable.
   357  
   358  If the feature you are introducing has high complexity, consider adding graduation
   359  milestones with these graduation criteria:
   360  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   361  - [Feature gate][feature gate] lifecycle
   362  - [Deprecation policy][deprecation-policy]
   363  
   364  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   365  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   366  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   367  -->
   368  
   369  ## Implementation History
   370  
   371  <!--
   372  Major milestones in the lifecycle of a KEP should be tracked in this section.
   373  Major milestones might include:
   374  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   375  - the `Proposal` section being merged, signaling agreement on a proposed design
   376  - the date implementation started
   377  - the first Kubernetes release where an initial version of the KEP was available
   378  - the version of Kubernetes where the KEP graduated to general availability
   379  - when the KEP was retired or superseded
   380  -->