sigs.k8s.io/kueue@v0.6.2/keps/79-hierarchical-cohorts/README.md (about)

     1  # KEP-79: Hierarchical Cohorts
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11      - [Story 2](#story-2)
    12    - [Risks and Mitigations](#risks-and-mitigations)
    13  - [Design Details](#design-details)
    14    - [Test Plan](#test-plan)
    15        - [Prerequisite testing updates](#prerequisite-testing-updates)
    16      - [Unit Tests](#unit-tests)
    17    - [Graduation Criteria](#graduation-criteria)
    18  - [Implementation History](#implementation-history)
    19  - [Drawbacks](#drawbacks)
    20  - [Alternatives](#alternatives)
    21  <!-- /toc -->
    22  
    23  ## Summary
    24  
    25  Introduce Cohort top-level object to allow setting multi-level quota 
    26  hierarchy, with advanced borrowing, lending mechanisms.
    27  
    28  ## Motivation
    29  
    30  The current 2-level hierarchy (ClusterQueues and Cohorts) is not expressive
    31  enough to handle complex use cases of large organizations with tree-like
    32  team and quota/budget structures.
    33  
    34  ### Goals
    35  
    36  * Create a multi-level hierarchy for advanced quota management. 
    37  * Be compatible with the existing ClusterQueue API and mechanics.
    38  * Allow setting constraints about borrowing and lending at all levels.
    39  * Provide quota for groups of Queues.
    40  
    41  ### Non-Goals
    42  
    43  * Change the existing API and mechanics in a not backward-compatible way.
    44  * Introduce an alternative API to ClusterQueue.
    45  * Introduce new ways of fair sharing, like ratio-based sharing (at least 
    46  not in this KEP).
    47  * Introduce additional preemption models (this will be in a separate KEP).
    48  
    49  ## Proposal
    50  
    51  Introduce a new object called Cohort with the similar quota provisioning mechanism
    52  as ClusterQueue. Cohorts additionally may specify its parent, another Cohort,
    53  forming together a tree-like organization structure. ClusterQueues will still be able to specify
    54  the Cohort they belong to. The Cohort mentioned by ClusterQueue doesn't require
    55  an actual object to be present. If such is not provided, it is understood 
    56  that the Cohort doesn't provide any quota, has no parent, doesn't belong to any bigger
    57  structure or has any non-default settings.
    58  
    59  The difference between ClusterQueue and Cohort will be that:
    60  
    61  * ClusterQueues are leaves in the organization tree, Cohorts are inner nodes. 
    62  * Cohort doesn't accept any workloads.
    63  * Nominal quota provided at the Cohort level is to be shared with the entire organization and doesn't 
    64    have an owning ClusterQueue.
    65  * Borrowing limit specified at the Cohort level means that the entire subtree cannot borrow more 
    66    from the rest of the organization tree than the given value.
    67  * Lending limit specified at the Cohort levels means that the rest of the organization tree
    68    cannot borrow more from the subtree than the given value.
    69  
    70  Preemmptions and resoruce reclamation will happen among the whole cohort structure,
    71  in the similar fashion as they are executed now.
    72  
    73  ### User Stories (Optional)
    74  
    75  #### Story 1
    76  
    77  I have two multi-team organizations in the company. One that does research and one that runs production
    78  workloads. Both are given some quota that is further distributed among the subteams. I want to grant 
    79  the production workloads the ability to borrow research quota if needed, but not the other way round.
    80  
    81  With this proposal, research org's top Cohort will simply set borrowingLimit to 0. Alternatively, production
    82  org's top Cohort can set lendingLimit to 0. BorrowingLimitis inside production org's ClusterQueues should
    83  be generous enough to allow borrowing from the research org.
    84  
    85  #### Story 2
    86  
    87  I have a couple organizations that have dedicated resources. The organizations should not borrow 
    88  from each other, however I want to have an additional "special" queue, with low priority jobs, that can 
    89  borrow unused capacity from any of the organizations.
    90  
    91  With this proposal, the cohorts for organizations will set borrowingLimit to 0. Top level Cohort will 
    92  contain all of these Cohorts, plus the "special" ClusterQueue, with borrowingLimit set to infinity. 
    93  
    94  ### Risks and Mitigations
    95  
    96  * Users may create a cycle in the Cohort hierarchy - Kueue will stop all new admissions within
    97  the entire tree. The already admitted workloads will be allowed to continue. Appropriate 
    98  ClusterQueue/Cohort Status Conditions will be set and Events emited.
    99  
   100  * Scheduling and preemption may require more computation/resources.
   101  
   102  ## Design Details
   103  
   104  The Cohort API will initially start only with the basic functionality. Additional policies
   105  regarding sharing Cohort resources can be added later.
   106  
   107  ```go
   108  
   109  type Cohort struct {
   110      metav1.TypeMeta   `json:",inline"`
   111      metav1.ObjectMeta `json:"metadata,omitempty"`
   112  
   113      Spec   CohortSpec   `json:"spec,omitempty"`
   114      Status CohortStatus `json:"status,omitempty"`
   115  }
   116  
   117  type CohortSpec struct {
   118      // Cohort parent name. The parent Cohort object doesn't have to exist.
   119      // In such case, it is assumed that parent simply doesn't have any
   120      // quota and limits and doesn't have any other custom settings.
   121      Parent *string `json:"parent,omitempty"`
   122  
   123      // resourceGroups describes groups of resources that the Cohort can
   124      // share with ClusterQueues within the same group of Cohorts/ClusterQueues.
   125      // Each resource group defines the list of resources and a list of flavors
   126      // that provide quotas for these resources.
   127      // Each resource and each flavor can only form part of one resource group.
   128      // resourceGroups can be up to 16.
   129      //
   130      // BorrowingLimit specifies how much ClusterQueues under this Cohort can borrow
   131      // from ClusterQueues/Cohorts that are NOT under this Cohort. For Cohorts without
   132      // a parent (top of the hierarchy) the BorrowingLimit has to be 0.
   133      //
   134      // LendingLimit specifies how much ClusterQueues that are NOT under this Cohort
   135      // can borrow from the ClusterQueues/Cohorts that are under this Cohort.
   136      // If any of the Limits is not specified it means that there is no limit
   137      // and ClusterQueues can borrow/lend as much as they want/have.
   138      // 
   139      // +listType=atomic
   140      // +kubebuilder:validation:MaxItems=16
   141      ResourceGroups []ResourceGroup `json:"resourceGroups,omitempty"`
   142  }
   143  
   144  const (
   145      // Condition indicating that a Cohort is correctly configured (for example, there is no cycle).
   146      CohortActive = "CohortActive"
   147  )
   148  
   149  // Status of the Cohort. May be empty if Cohort support is not enabled in alpha.
   150  // Status and stats may not cover the entire subtree, as the number of needed updates
   151  // per workload admission may be to high.
   152  type CohortStatus struct {
   153      // conditions hold the latest available observations of the Conditions
   154      // current state.
   155      // +optional
   156      // +listType=map
   157      // +listMapKey=type
   158      // +patchStrategy=merge
   159      // +patchMergeKey=type
   160      Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
   161  
   162      // Additional stats may be added in the future, like the number 
   163      // of admitted workloads, their usage etc, based on the user feedback.
   164  }
   165  ```
   166  
   167  Currently, with 2-level hierarchy for each Cohort and ClusterQueue, Kueue
   168  makes sure that the following balances are kept:
   169  
   170  * ClusterQueues don't use more resources than they have and could possibly borrow.
   171  * Within a Cohort, the total amount of requested capacity doesn't exceed the total quota
   172  from all ClusterQueues, constrained by LendingLimit.
   173  
   174  Admission of a new workload can happen if both balances are kept after 
   175  adding the new workload. Kueue doesn't track who is borrowing/lending from who. It
   176  is enough that balances are kept and with good balances, there exists such borrower-lending 
   177  mapping that fulfills all needs. 
   178  
   179  With Hierarchical Cohorts, Kueue will be checking the whole Cohort subtree whether the
   180  correct balances are kept. To be more precise what it means
   181  let's define a function `T(x,r)` that takes either ClusterQueue x 
   182  or Cohort x and resource r (from a specific resource flavor). 
   183  
   184  `T(x, r)` returns the amount of resource r that is avaialble at the level of x from ClusterQueues
   185  and Cohorts that are either x or children of x (possibly indirect). In other words, how much of resource r can come from 
   186  the subtree. The value may be negative, what means that the subtree is borrowing from the outside of the subtree (the rest 
   187  of the hierarchy)
   188  
   189  `T(x,r)` can be relatively easily calculated while traversing the Cohort tree.
   190  
   191  * `T(x,r)` when x is a ClusterQueue:
   192  $$T(x,r) = quota(x,r) - usage(x,r)$$
   193  
   194  * `T(x,r)` when x is a Cohort:
   195  $$T(x,r) = quota(x,r) + \sum_{c \in children(x)} min(lendingLimit(c,r), T(c,r))$$
   196  
   197  Obviously, with the correct admission process, for any x and r, `T(x,r)  >=  -borrowingLimit(x,r)`
   198  Otherwise there would be too big debt at level x - some subtree is requesting more than allowed.
   199  
   200  Slightly less obvious, but also true is: **If there is no too big debt at any level then the admission is correct**.
   201  
   202  Negative `T(x,r)` presents the total amount of resources that a subtree is borrowing. Positive `T(x,r)` presents the total 
   203  amount of resources that a subtree can deliver (with respect of the `lendingLimit`). At the very top of hierarchy `T(x,r) >=0` 
   204  (`borrowingLimit` is there 0 since there is no-one to borrow from). `T(x,r)>=0` can occur also within the hierarchy.
   205  
   206  `T(x,r)>=0` means that at the level of x, the negative balance of all subtrees can be evened-out by
   207  other subtrees that have some extra capacity, with respect to their lendingLimit. Extra capacity can be 
   208  "passed" to the needing subtrees. Then, after this passing, the previously negative subtree becomes positive, and 
   209  we can re-apply the logic there. All negative sub-sub-trees can be balanced out by positive subtrees and the capacity
   210  that coming from "above". And so on and so on, up to reaching individual ClusterQueues.
   211  
   212  So a new workload can be admitted to a ClusterQueue if and only if, after admission, `T(x,r)  >=  -borrowingLimit(x,r)` 
   213  stays true at all elements of the hierarchy.
   214  
   215  ### Test Plan
   216  
   217  [x] I/we understand the owners of the involved components may require updates to
   218  existing tests to make this code solid enough prior to committing the changes necessary
   219  to implement this enhancement.
   220  
   221  ##### Prerequisite testing updates
   222  
   223  #### Unit Tests
   224  
   225  As the hierarchical cohorts reside entirely inside Kueue, most of the 
   226  tests will be done as unit and integration tests, checking things like:
   227  
   228  * Existing functionality at 2-levels.
   229  * Long-distance borrowing on multi-level hierarchy.
   230  * Lending/borrowing limits placed on many levels.
   231  * Preemptions across hierarchy.
   232  
   233  ### Graduation Criteria
   234  
   235  This is an a core API element and will graduate together with the other core APIs.
   236  
   237  
   238  ## Implementation History
   239  
   240  * 2023.12.29 - KEP - API and semantics.
   241  
   242  ## Drawbacks
   243  
   244  It makes the scheduling even more complex and computation heavy. With complex limits 
   245  and quotas it may be hard for users to keep them under control.
   246  
   247  ## Alternatives
   248  
   249  * https://github.com/kubernetes-sigs/kueue/pull/1093 - Hierarchical ClusterQueues.