sigs.k8s.io/kueue@v0.6.2/keps/79-hierarchical-cohorts/README.md (about) 1 # KEP-79: Hierarchical Cohorts 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories (Optional)](#user-stories-optional) 10 - [Story 1](#story-1) 11 - [Story 2](#story-2) 12 - [Risks and Mitigations](#risks-and-mitigations) 13 - [Design Details](#design-details) 14 - [Test Plan](#test-plan) 15 - [Prerequisite testing updates](#prerequisite-testing-updates) 16 - [Unit Tests](#unit-tests) 17 - [Graduation Criteria](#graduation-criteria) 18 - [Implementation History](#implementation-history) 19 - [Drawbacks](#drawbacks) 20 - [Alternatives](#alternatives) 21 <!-- /toc --> 22 23 ## Summary 24 25 Introduce Cohort top-level object to allow setting multi-level quota 26 hierarchy, with advanced borrowing, lending mechanisms. 27 28 ## Motivation 29 30 The current 2-level hierarchy (ClusterQueues and Cohorts) is not expressive 31 enough to handle complex use cases of large organizations with tree-like 32 team and quota/budget structures. 33 34 ### Goals 35 36 * Create a multi-level hierarchy for advanced quota management. 37 * Be compatible with the existing ClusterQueue API and mechanics. 38 * Allow setting constraints about borrowing and lending at all levels. 39 * Provide quota for groups of Queues. 40 41 ### Non-Goals 42 43 * Change the existing API and mechanics in a not backward-compatible way. 44 * Introduce an alternative API to ClusterQueue. 45 * Introduce new ways of fair sharing, like ratio-based sharing (at least 46 not in this KEP). 47 * Introduce additional preemption models (this will be in a separate KEP). 48 49 ## Proposal 50 51 Introduce a new object called Cohort with the similar quota provisioning mechanism 52 as ClusterQueue. Cohorts additionally may specify its parent, another Cohort, 53 forming together a tree-like organization structure. ClusterQueues will still be able to specify 54 the Cohort they belong to. The Cohort mentioned by ClusterQueue doesn't require 55 an actual object to be present. If such is not provided, it is understood 56 that the Cohort doesn't provide any quota, has no parent, doesn't belong to any bigger 57 structure or has any non-default settings. 58 59 The difference between ClusterQueue and Cohort will be that: 60 61 * ClusterQueues are leaves in the organization tree, Cohorts are inner nodes. 62 * Cohort doesn't accept any workloads. 63 * Nominal quota provided at the Cohort level is to be shared with the entire organization and doesn't 64 have an owning ClusterQueue. 65 * Borrowing limit specified at the Cohort level means that the entire subtree cannot borrow more 66 from the rest of the organization tree than the given value. 67 * Lending limit specified at the Cohort levels means that the rest of the organization tree 68 cannot borrow more from the subtree than the given value. 69 70 Preemmptions and resoruce reclamation will happen among the whole cohort structure, 71 in the similar fashion as they are executed now. 72 73 ### User Stories (Optional) 74 75 #### Story 1 76 77 I have two multi-team organizations in the company. One that does research and one that runs production 78 workloads. Both are given some quota that is further distributed among the subteams. I want to grant 79 the production workloads the ability to borrow research quota if needed, but not the other way round. 80 81 With this proposal, research org's top Cohort will simply set borrowingLimit to 0. Alternatively, production 82 org's top Cohort can set lendingLimit to 0. BorrowingLimitis inside production org's ClusterQueues should 83 be generous enough to allow borrowing from the research org. 84 85 #### Story 2 86 87 I have a couple organizations that have dedicated resources. The organizations should not borrow 88 from each other, however I want to have an additional "special" queue, with low priority jobs, that can 89 borrow unused capacity from any of the organizations. 90 91 With this proposal, the cohorts for organizations will set borrowingLimit to 0. Top level Cohort will 92 contain all of these Cohorts, plus the "special" ClusterQueue, with borrowingLimit set to infinity. 93 94 ### Risks and Mitigations 95 96 * Users may create a cycle in the Cohort hierarchy - Kueue will stop all new admissions within 97 the entire tree. The already admitted workloads will be allowed to continue. Appropriate 98 ClusterQueue/Cohort Status Conditions will be set and Events emited. 99 100 * Scheduling and preemption may require more computation/resources. 101 102 ## Design Details 103 104 The Cohort API will initially start only with the basic functionality. Additional policies 105 regarding sharing Cohort resources can be added later. 106 107 ```go 108 109 type Cohort struct { 110 metav1.TypeMeta `json:",inline"` 111 metav1.ObjectMeta `json:"metadata,omitempty"` 112 113 Spec CohortSpec `json:"spec,omitempty"` 114 Status CohortStatus `json:"status,omitempty"` 115 } 116 117 type CohortSpec struct { 118 // Cohort parent name. The parent Cohort object doesn't have to exist. 119 // In such case, it is assumed that parent simply doesn't have any 120 // quota and limits and doesn't have any other custom settings. 121 Parent *string `json:"parent,omitempty"` 122 123 // resourceGroups describes groups of resources that the Cohort can 124 // share with ClusterQueues within the same group of Cohorts/ClusterQueues. 125 // Each resource group defines the list of resources and a list of flavors 126 // that provide quotas for these resources. 127 // Each resource and each flavor can only form part of one resource group. 128 // resourceGroups can be up to 16. 129 // 130 // BorrowingLimit specifies how much ClusterQueues under this Cohort can borrow 131 // from ClusterQueues/Cohorts that are NOT under this Cohort. For Cohorts without 132 // a parent (top of the hierarchy) the BorrowingLimit has to be 0. 133 // 134 // LendingLimit specifies how much ClusterQueues that are NOT under this Cohort 135 // can borrow from the ClusterQueues/Cohorts that are under this Cohort. 136 // If any of the Limits is not specified it means that there is no limit 137 // and ClusterQueues can borrow/lend as much as they want/have. 138 // 139 // +listType=atomic 140 // +kubebuilder:validation:MaxItems=16 141 ResourceGroups []ResourceGroup `json:"resourceGroups,omitempty"` 142 } 143 144 const ( 145 // Condition indicating that a Cohort is correctly configured (for example, there is no cycle). 146 CohortActive = "CohortActive" 147 ) 148 149 // Status of the Cohort. May be empty if Cohort support is not enabled in alpha. 150 // Status and stats may not cover the entire subtree, as the number of needed updates 151 // per workload admission may be to high. 152 type CohortStatus struct { 153 // conditions hold the latest available observations of the Conditions 154 // current state. 155 // +optional 156 // +listType=map 157 // +listMapKey=type 158 // +patchStrategy=merge 159 // +patchMergeKey=type 160 Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"` 161 162 // Additional stats may be added in the future, like the number 163 // of admitted workloads, their usage etc, based on the user feedback. 164 } 165 ``` 166 167 Currently, with 2-level hierarchy for each Cohort and ClusterQueue, Kueue 168 makes sure that the following balances are kept: 169 170 * ClusterQueues don't use more resources than they have and could possibly borrow. 171 * Within a Cohort, the total amount of requested capacity doesn't exceed the total quota 172 from all ClusterQueues, constrained by LendingLimit. 173 174 Admission of a new workload can happen if both balances are kept after 175 adding the new workload. Kueue doesn't track who is borrowing/lending from who. It 176 is enough that balances are kept and with good balances, there exists such borrower-lending 177 mapping that fulfills all needs. 178 179 With Hierarchical Cohorts, Kueue will be checking the whole Cohort subtree whether the 180 correct balances are kept. To be more precise what it means 181 let's define a function `T(x,r)` that takes either ClusterQueue x 182 or Cohort x and resource r (from a specific resource flavor). 183 184 `T(x, r)` returns the amount of resource r that is avaialble at the level of x from ClusterQueues 185 and Cohorts that are either x or children of x (possibly indirect). In other words, how much of resource r can come from 186 the subtree. The value may be negative, what means that the subtree is borrowing from the outside of the subtree (the rest 187 of the hierarchy) 188 189 `T(x,r)` can be relatively easily calculated while traversing the Cohort tree. 190 191 * `T(x,r)` when x is a ClusterQueue: 192 $$T(x,r) = quota(x,r) - usage(x,r)$$ 193 194 * `T(x,r)` when x is a Cohort: 195 $$T(x,r) = quota(x,r) + \sum_{c \in children(x)} min(lendingLimit(c,r), T(c,r))$$ 196 197 Obviously, with the correct admission process, for any x and r, `T(x,r) >= -borrowingLimit(x,r)` 198 Otherwise there would be too big debt at level x - some subtree is requesting more than allowed. 199 200 Slightly less obvious, but also true is: **If there is no too big debt at any level then the admission is correct**. 201 202 Negative `T(x,r)` presents the total amount of resources that a subtree is borrowing. Positive `T(x,r)` presents the total 203 amount of resources that a subtree can deliver (with respect of the `lendingLimit`). At the very top of hierarchy `T(x,r) >=0` 204 (`borrowingLimit` is there 0 since there is no-one to borrow from). `T(x,r)>=0` can occur also within the hierarchy. 205 206 `T(x,r)>=0` means that at the level of x, the negative balance of all subtrees can be evened-out by 207 other subtrees that have some extra capacity, with respect to their lendingLimit. Extra capacity can be 208 "passed" to the needing subtrees. Then, after this passing, the previously negative subtree becomes positive, and 209 we can re-apply the logic there. All negative sub-sub-trees can be balanced out by positive subtrees and the capacity 210 that coming from "above". And so on and so on, up to reaching individual ClusterQueues. 211 212 So a new workload can be admitted to a ClusterQueue if and only if, after admission, `T(x,r) >= -borrowingLimit(x,r)` 213 stays true at all elements of the hierarchy. 214 215 ### Test Plan 216 217 [x] I/we understand the owners of the involved components may require updates to 218 existing tests to make this code solid enough prior to committing the changes necessary 219 to implement this enhancement. 220 221 ##### Prerequisite testing updates 222 223 #### Unit Tests 224 225 As the hierarchical cohorts reside entirely inside Kueue, most of the 226 tests will be done as unit and integration tests, checking things like: 227 228 * Existing functionality at 2-levels. 229 * Long-distance borrowing on multi-level hierarchy. 230 * Lending/borrowing limits placed on many levels. 231 * Preemptions across hierarchy. 232 233 ### Graduation Criteria 234 235 This is an a core API element and will graduate together with the other core APIs. 236 237 238 ## Implementation History 239 240 * 2023.12.29 - KEP - API and semantics. 241 242 ## Drawbacks 243 244 It makes the scheduling even more complex and computation heavy. With complex limits 245 and quotas it may be hard for users to keep them under control. 246 247 ## Alternatives 248 249 * https://github.com/kubernetes-sigs/kueue/pull/1093 - Hierarchical ClusterQueues.