sigs.k8s.io/kueue@v0.6.2/keps/79-hierarchical-cohorts/README.md

sigs.k8s.io/kueue@v0.6.2/keps/79-hierarchical-cohorts/README.md (about)

1 # KEP-79: Hierarchical Cohorts
2
3 
4 - [Summary](#summary)
5 - [Motivation](#motivation)
6 - [Goals](#goals)
7 - [Non-Goals](#non-goals)
8 - [Proposal](#proposal)
9 - [User Stories (Optional)](#user-stories-optional)
10 - [Story 1](#story-1)
11 - [Story 2](#story-2)
12 - [Risks and Mitigations](#risks-and-mitigations)
13 - [Design Details](#design-details)
14 - [Test Plan](#test-plan)
15 - [Prerequisite testing updates](#prerequisite-testing-updates)
16 - [Unit Tests](#unit-tests)
17 - [Graduation Criteria](#graduation-criteria)
18 - [Implementation History](#implementation-history)
19 - [Drawbacks](#drawbacks)
20 - [Alternatives](#alternatives)
21 
22
23 ## Summary
24
25 Introduce Cohort top-level object to allow setting multi-level quota
26 hierarchy, with advanced borrowing, lending mechanisms.
27
28 ## Motivation
29
30 The current 2-level hierarchy (ClusterQueues and Cohorts) is not expressive
31 enough to handle complex use cases of large organizations with tree-like
32 team and quota/budget structures.
33
34 ### Goals
35
36 * Create a multi-level hierarchy for advanced quota management.
37 * Be compatible with the existing ClusterQueue API and mechanics.
38 * Allow setting constraints about borrowing and lending at all levels.
39 * Provide quota for groups of Queues.
40
41 ### Non-Goals
42
43 * Change the existing API and mechanics in a not backward-compatible way.
44 * Introduce an alternative API to ClusterQueue.
45 * Introduce new ways of fair sharing, like ratio-based sharing (at least
46 not in this KEP).
47 * Introduce additional preemption models (this will be in a separate KEP).
48
49 ## Proposal
50
51 Introduce a new object called Cohort with the similar quota provisioning mechanism
52 as ClusterQueue. Cohorts additionally may specify its parent, another Cohort,
53 forming together a tree-like organization structure. ClusterQueues will still be able to specify
54 the Cohort they belong to. The Cohort mentioned by ClusterQueue doesn't require
55 an actual object to be present. If such is not provided, it is understood
56 that the Cohort doesn't provide any quota, has no parent, doesn't belong to any bigger
57 structure or has any non-default settings.
58
59 The difference between ClusterQueue and Cohort will be that:
60
61 * ClusterQueues are leaves in the organization tree, Cohorts are inner nodes.
62 * Cohort doesn't accept any workloads.
63 * Nominal quota provided at the Cohort level is to be shared with the entire organization and doesn't
64 have an owning ClusterQueue.
65 * Borrowing limit specified at the Cohort level means that the entire subtree cannot borrow more
66 from the rest of the organization tree than the given value.
67 * Lending limit specified at the Cohort levels means that the rest of the organization tree
68 cannot borrow more from the subtree than the given value.
69
70 Preemmptions and resoruce reclamation will happen among the whole cohort structure,
71 in the similar fashion as they are executed now.
72
73 ### User Stories (Optional)
74
75 #### Story 1
76
77 I have two multi-team organizations in the company. One that does research and one that runs production
78 workloads. Both are given some quota that is further distributed among the subteams. I want to grant
79 the production workloads the ability to borrow research quota if needed, but not the other way round.
80
81 With this proposal, research org's top Cohort will simply set borrowingLimit to 0. Alternatively, production
82 org's top Cohort can set lendingLimit to 0. BorrowingLimitis inside production org's ClusterQueues should
83 be generous enough to allow borrowing from the research org.
84
85 #### Story 2
86
87 I have a couple organizations that have dedicated resources. The organizations should not borrow
88 from each other, however I want to have an additional "special" queue, with low priority jobs, that can
89 borrow unused capacity from any of the organizations.
90
91 With this proposal, the cohorts for organizations will set borrowingLimit to 0. Top level Cohort will
92 contain all of these Cohorts, plus the "special" ClusterQueue, with borrowingLimit set to infinity.
93
94 ### Risks and Mitigations
95
96 * Users may create a cycle in the Cohort hierarchy - Kueue will stop all new admissions within
97 the entire tree. The already admitted workloads will be allowed to continue. Appropriate
98 ClusterQueue/Cohort Status Conditions will be set and Events emited.
99
100 * Scheduling and preemption may require more computation/resources.
101
102 ## Design Details
103
104 The Cohort API will initially start only with the basic functionality. Additional policies
105 regarding sharing Cohort resources can be added later.
106
107 ```go
108
109 type Cohort struct {
110 metav1.TypeMeta `json:",inline"`
111 metav1.ObjectMeta `json:"metadata,omitempty"`
112
113 Spec CohortSpec `json:"spec,omitempty"`
114 Status CohortStatus `json:"status,omitempty"`
115 }
116
117 type CohortSpec struct {
118 // Cohort parent name. The parent Cohort object doesn't have to exist.
119 // In such case, it is assumed that parent simply doesn't have any
120 // quota and limits and doesn't have any other custom settings.
121 Parent *string `json:"parent,omitempty"`
122
123 // resourceGroups describes groups of resources that the Cohort can
124 // share with ClusterQueues within the same group of Cohorts/ClusterQueues.
125 // Each resource group defines the list of resources and a list of flavors
126 // that provide quotas for these resources.
127 // Each resource and each flavor can only form part of one resource group.
128 // resourceGroups can be up to 16.
129 //
130 // BorrowingLimit specifies how much ClusterQueues under this Cohort can borrow
131 // from ClusterQueues/Cohorts that are NOT under this Cohort. For Cohorts without
132 // a parent (top of the hierarchy) the BorrowingLimit has to be 0.
133 //
134 // LendingLimit specifies how much ClusterQueues that are NOT under this Cohort
135 // can borrow from the ClusterQueues/Cohorts that are under this Cohort.
136 // If any of the Limits is not specified it means that there is no limit
137 // and ClusterQueues can borrow/lend as much as they want/have.
138 //
139 // +listType=atomic
140 // +kubebuilder:validation:MaxItems=16
141 ResourceGroups []ResourceGroup `json:"resourceGroups,omitempty"`
142 }
143
144 const (
145 // Condition indicating that a Cohort is correctly configured (for example, there is no cycle).
146 CohortActive = "CohortActive"
147 )
148
149 // Status of the Cohort. May be empty if Cohort support is not enabled in alpha.
150 // Status and stats may not cover the entire subtree, as the number of needed updates
151 // per workload admission may be to high.
152 type CohortStatus struct {
153 // conditions hold the latest available observations of the Conditions
154 // current state.
155 // +optional
156 // +listType=map
157 // +listMapKey=type
158 // +patchStrategy=merge
159 // +patchMergeKey=type
160 Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
161
162 // Additional stats may be added in the future, like the number
163 // of admitted workloads, their usage etc, based on the user feedback.
164 }
165 ```
166
167 Currently, with 2-level hierarchy for each Cohort and ClusterQueue, Kueue
168 makes sure that the following balances are kept:
169
170 * ClusterQueues don't use more resources than they have and could possibly borrow.
171 * Within a Cohort, the total amount of requested capacity doesn't exceed the total quota
172 from all ClusterQueues, constrained by LendingLimit.
173
174 Admission of a new workload can happen if both balances are kept after
175 adding the new workload. Kueue doesn't track who is borrowing/lending from who. It
176 is enough that balances are kept and with good balances, there exists such borrower-lending
177 mapping that fulfills all needs.
178
179 With Hierarchical Cohorts, Kueue will be checking the whole Cohort subtree whether the
180 correct balances are kept. To be more precise what it means
181 let's define a function `T(x,r)` that takes either ClusterQueue x
182 or Cohort x and resource r (from a specific resource flavor).
183
184 `T(x, r)` returns the amount of resource r that is avaialble at the level of x from ClusterQueues
185 and Cohorts that are either x or children of x (possibly indirect). In other words, how much of resource r can come from
186 the subtree. The value may be negative, what means that the subtree is borrowing from the outside of the subtree (the rest
187 of the hierarchy)
188
189 `T(x,r)` can be relatively easily calculated while traversing the Cohort tree.
190
191 * `T(x,r)` when x is a ClusterQueue:
192 $$T(x,r) = quota(x,r) - usage(x,r)$$
193
194 * `T(x,r)` when x is a Cohort:
195 $$T(x,r) = quota(x,r) + \sum_{c \in children(x)} min(lendingLimit(c,r), T(c,r))$$
196
197 Obviously, with the correct admission process, for any x and r, `T(x,r) >= -borrowingLimit(x,r)`
198 Otherwise there would be too big debt at level x - some subtree is requesting more than allowed.
199
200 Slightly less obvious, but also true is: **If there is no too big debt at any level then the admission is correct**.
201
202 Negative `T(x,r)` presents the total amount of resources that a subtree is borrowing. Positive `T(x,r)` presents the total
203 amount of resources that a subtree can deliver (with respect of the `lendingLimit`). At the very top of hierarchy `T(x,r) >=0`
204 (`borrowingLimit` is there 0 since there is no-one to borrow from). `T(x,r)>=0` can occur also within the hierarchy.
205
206 `T(x,r)>=0` means that at the level of x, the negative balance of all subtrees can be evened-out by
207 other subtrees that have some extra capacity, with respect to their lendingLimit. Extra capacity can be
208 "passed" to the needing subtrees. Then, after this passing, the previously negative subtree becomes positive, and
209 we can re-apply the logic there. All negative sub-sub-trees can be balanced out by positive subtrees and the capacity
210 that coming from "above". And so on and so on, up to reaching individual ClusterQueues.
211
212 So a new workload can be admitted to a ClusterQueue if and only if, after admission, `T(x,r) >= -borrowingLimit(x,r)`
213 stays true at all elements of the hierarchy.
214
215 ### Test Plan
216
217 [x] I/we understand the owners of the involved components may require updates to
218 existing tests to make this code solid enough prior to committing the changes necessary
219 to implement this enhancement.
220
221 ##### Prerequisite testing updates
222
223 #### Unit Tests
224
225 As the hierarchical cohorts reside entirely inside Kueue, most of the
226 tests will be done as unit and integration tests, checking things like:
227
228 * Existing functionality at 2-levels.
229 * Long-distance borrowing on multi-level hierarchy.
230 * Lending/borrowing limits placed on many levels.
231 * Preemptions across hierarchy.
232
233 ### Graduation Criteria
234
235 This is an a core API element and will graduate together with the other core APIs.
236
237
238 ## Implementation History
239
240 * 2023.12.29 - KEP - API and semantics.
241
242 ## Drawbacks
243
244 It makes the scheduling even more complex and computation heavy. With complex limits
245 and quotas it may be hard for users to keep them under control.
246
247 ## Alternatives
248
249 * https://github.com/kubernetes-sigs/kueue/pull/1093 - Hierarchical ClusterQueues.