sigs.k8s.io/kueue@v0.6.2/keps/1337-preempt-within-cohort-while-borrowing/README.md (about) 1 # KEP-1337: Preempt within cohort while borrowing 2 3 <!-- 4 This is the title of your KEP. Keep it short, simple, and descriptive. A good 5 title can help communicate what the KEP is and should be considered as part of 6 any review. 7 --> 8 9 <!-- 10 A table of contents is helpful for quickly jumping to sections of a KEP and for 11 highlighting any additional information provided beyond the standard KEP 12 template. 13 14 Ensure the TOC is wrapped with 15 <code><!-- toc --&rt;<!-- /toc --&rt;</code> 16 tags, and then generate with `hack/update-toc.sh`. 17 --> 18 19 <!-- toc --> 20 - [Summary](#summary) 21 - [Motivation](#motivation) 22 - [Goals](#goals) 23 - [Non-Goals](#non-goals) 24 - [Proposal](#proposal) 25 - [User Stories (Optional)](#user-stories-optional) 26 - [Story 1](#story-1) 27 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 28 - [Risks and Mitigations](#risks-and-mitigations) 29 - [Flopping between preempting pair of workloads](#flopping-between-preempting-pair-of-workloads) 30 - [Design Details](#design-details) 31 - [API](#api) 32 - [Implementation overview](#implementation-overview) 33 - [Test Plan](#test-plan) 34 - [Prerequisite testing updates](#prerequisite-testing-updates) 35 - [Unit Tests](#unit-tests) 36 - [Integration tests](#integration-tests) 37 - [Graduation Criteria](#graduation-criteria) 38 - [Beta](#beta) 39 - [GA](#ga) 40 - [Implementation History](#implementation-history) 41 - [Drawbacks](#drawbacks) 42 - [Alternatives](#alternatives) 43 - [Reuse PreemptionPolicy enum](#reuse-preemptionpolicy-enum) 44 - [Minimal API](#minimal-api) 45 - [ReclaimWithinCohortConfig](#reclaimwithincohortconfig) 46 - [Move ReclaimWithinCohort under ReclaimWithinCohortConfig](#move-reclaimwithincohort-under-reclaimwithincohortconfig) 47 <!-- /toc --> 48 49 ## Summary 50 51 This KEP introduces ClusterQueue-level API to support workload preemption 52 to reclaim quota within cohort for a workload that requires borrowing. 53 54 <!-- 55 This section is incredibly important for producing high-quality, user-focused 56 documentation such as release notes or a development roadmap. It should be 57 possible to collect this information before implementation begins, in order to 58 avoid requiring implementors to split their attention between writing release 59 notes and implementing the feature itself. KEP editors and SIG Docs 60 should help to ensure that the tone and content of the `Summary` section is 61 useful for a wide audience. 62 63 A good summary is probably at least a paragraph in length. 64 65 Both in this section and below, follow the guidelines of the [documentation 66 style guide]. In particular, wrap lines to a reasonable length, to make it 67 easier for reviewers to cite specific portions, and to minimize diff churn on 68 updates. 69 70 [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md 71 --> 72 73 ## Motivation 74 75 Currently, there is no support for preemption to reclaim quota within cohort 76 for workloads which require borrowing. This is too constraining for some setups. 77 78 <!-- 79 This section is for explicitly listing the motivation, goals, and non-goals of 80 this KEP. Describe why the change is important and the benefits to users. The 81 motivation section can optionally provide links to [experience reports] to 82 demonstrate the interest in a KEP within the wider Kubernetes community. 83 84 [experience reports]: https://github.com/golang/go/wiki/ExperienceReports 85 --> 86 87 ### Goals 88 89 - support preemption to reclaim quota within cohort for workloads that require borrowing. 90 - support preemption within cluster queue for workloads that require borrowing. 91 - modify the workload ordering by scheduler to take the need to preempt into account. 92 93 <!-- 94 List the specific goals of the KEP. What is it trying to achieve? How will we 95 know that this has succeeded? 96 --> 97 98 ### Non-Goals 99 100 - allow preemption of workloads, from other ClusterQueues, which are not borrowing. 101 102 <!-- 103 What is out of scope for this KEP? Listing non-goals helps to focus discussion 104 and make progress. 105 --> 106 107 ## Proposal 108 109 <!-- 110 This is where we get down to the specifics of what the proposal actually is. 111 This should have enough detail that reviewers can understand exactly what 112 you're proposing, but should not include things like API designs or 113 implementation. What is the desired outcome and how do we measure success?. 114 The "Design Details" section below is for the real 115 nitty-gritty. 116 --> 117 118 Introduce a new ClusterLevel API to support preemption while borrowing. The 119 API proposes additional configuration to the `.preemption.reclaimWithinCohort` 120 option. 121 122 ### User Stories (Optional) 123 124 <!-- 125 Detail the things that people will be able to do if this KEP is implemented. 126 Include as much detail as possible so that people can understand the "how" of 127 the system. The goal here is to make this feel real for users without getting 128 bogged down. 129 --> 130 131 #### Story 1 132 133 As an administrator of Kueue in my organization I manage the following setup 134 with two teams, `A` and `B`. There is a single cohort, and both teams have a 135 pair of ClusterQueues in the cohort, called `Standard` and `Best-Effort`. There 136 is also one ClusterQueue, called `Shared`, which is shared between the teams. 137 138 All the `Standard` and `Best-Effort` ClusterQueues have `nominalQuota=0`, and 139 `borrowingLimit>0`. The actual quota is exposed by the `Shared` ClusterQueue. 140 The `Shared` ClusterQueue has `nominalQuota>0`, and `borrowingLimit=0`. This 141 setup is depicted by the table: 142 143 | | A Standard | A Best-Effort | B Standard | B Best-Effort | Shared | 144 |-----------------|---------------|---------------|---------------|---------------|---------------| 145 |nominalQuota | 0 | 0 | 0 | 0 | 100 | 146 |borrowingLimit | 100 | 100 | 100 | 100 | 0 | 147 148 The teams send their high-priority (say `>200`) jobs to their `Standard` 149 ClusterQueue, and the low-priority jobs (say `<100`) to their `Best-Effort` 150 ClusterQueue. Jobs sent to the `Standard` ClusterQueues should not be preempted. 151 Jobs in the `Best-Effort` ClusterQueues can be preempted with respect to 152 priorities. 153 154 Note that in this setup every workload is borrowing, so I need to make preemption 155 while borrowing enabled. Additionally, I need a mechanism to say that workloads 156 from the `Standard` ClusterQueues are not preempted. A mechanism to only preempt 157 jobs with priorities below a specified threshold will suffice. 158 159 ### Notes/Constraints/Caveats (Optional) 160 161 <!-- 162 What are the caveats to the proposal? 163 What are some important details that didn't come across above? 164 Go in to as much detail as necessary here. 165 This might be a good place to talk about core concepts and how they relate. 166 --> 167 168 ### Risks and Mitigations 169 170 #### Flopping between preempting pair of workloads 171 172 In a setup as described in (Story 1)[#story-1] consider there are two workloads, 173 one in ClusterQueue A, and one in ClusterQueue B. There is not enough resources 174 in the cohort to run both workloads at the same time. 175 176 If the preemption is allowed for workloads of any priorities 177 (as when `.preemption.reclaimWithinCohort` equals `Any`) 178 then the following flopping happens: workload A preempts workload B, then 179 preempted workload B returns to the queue and preempts workload A to make progress. 180 This repeats, and none of the workloads can make any progress. 181 182 In order to mitigate this risk we only enable the preemption to reclaim 183 quota within cohort, for workloads requiring borrowing when such preempted 184 workloads are of lower priority. This behavior is enabled by setting 185 `.preemption.borrowWithinCohort.policy: LowerPriority`. 186 187 <!-- 188 What are the risks of this proposal, and how do we mitigate? Think broadly. 189 For example, consider both security and how this will impact the larger 190 Kubernetes ecosystem. 191 192 How will security be reviewed, and by whom? 193 194 How will UX be reviewed, and by whom? 195 196 Consider including folks who also work outside the SIG or subproject. 197 --> 198 199 ## Design Details 200 201 ### API 202 203 ```golang 204 // ClusterQueuePreemption contains policies to preempt Workloads from this 205 // ClusterQueue or the ClusterQueue's cohort. 206 type ClusterQueuePreemption struct { 207 ... 208 // borrowWithinCohort provides configuration to allow preemption within 209 // cohort while borrowing. This configuration can only be enabled when 210 // reclaimWithinCohort is not `Never`. 211 BorrowWithinCohort *BorrowWithinCohort `json:"borrowWithinCohort,omitempty"` 212 ... 213 } 214 215 type BorrowWithinCohortPolicy string 216 217 const ( 218 BorrowWithinCohortPolicyNever BorrowWithinCohortPolicy = "Never" 219 BorrowWithinCohortPolicyLowerPriority BorrowWithinCohortPolicy = "LowerPriority" 220 ) 221 222 // BorrowWithinCohort contains configuration which allows to preempt workloads 223 // within cohort while borrowing. 224 type BorrowWithinCohort struct { 225 // policy determines the policy for preemption to reclaim quota within cohort while borrowing. 226 // Possible values are: 227 // - `Never` (default): do not allow for preemption, in other 228 // ClusterQueues within the cohort, for a borrowing workload. 229 // - `LowerPriority`: allow preemption, in other ClusterQueues 230 // within the cohort, for a borrowing workload, but only if 231 // the preempted workloads are of lower priority. 232 // 233 // +kubebuilder:default=Never 234 // +kubebuilder:validation:Enum=Never;LowerPriority 235 Policy BorrowWithinCohortPolicy `json:"policy,omitempty"` 236 237 // maxPriorityThreshold allows to restrict the set of workloads which 238 // might be preempted by a borrowing workload, to only workloads with 239 // priority below or equal the specified level. 240 // 241 // +optional 242 MaxPriorityThreshold *int32 `json:"maxPriorityThreshold,omitempty"` 243 } 244 ``` 245 246 ### Implementation overview 247 248 There are main modifications: 249 - in the [`fitsResourceQuota`](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/scheduler/flavorassigner/flavorassigner.go#L551) 250 we return `FlavorAssignmentMode=Preempt` when the workload does not fit, but 251 there is enough quota within the cohort 252 - modify the [`GetTargets`](https://github.com/kubernetes-sigs/kueue/blob/53d4b657655aff5df3748e65b4226cf11e50eeba/pkg/scheduler/preemption/preemption.go#L77) 253 and the [`minimalPreemptions`](https://github.com/kubernetes-sigs/kueue/blob/53d4b657655aff5df3748e65b4226cf11e50eeba/pkg/scheduler/preemption/preemption.go#L159) functions 254 to only, if borrowing, include candidates which satisfy the new API's constrains. 255 256 Additionally, we add a webhook validation in `clusterqueue_webhook` to 257 verify that `.preemption.reclaimWithinCohort` is not `Never`, when 258 `.preemption.borrowWithinCohort` is enabled. 259 260 <!-- 261 This section should contain enough information that the specifics of your 262 change are understandable. This may include API specs (though not always 263 required) or even code snippets. If there's any ambiguity about HOW your 264 proposal will be implemented, this is the place to discuss them. 265 --> 266 267 ### Test Plan 268 269 <!-- 270 **Note:** *Not required until targeted at a release.* 271 The goal is to ensure that we don't accept enhancements with inadequate testing. 272 273 All code is expected to have adequate tests (eventually with coverage 274 expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] 275 when drafting this test plan. 276 277 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 278 --> 279 280 [x] I/we understand the owners of the involved components may require updates to 281 existing tests to make this code solid enough prior to committing the changes necessary 282 to implement this enhancement. 283 284 ##### Prerequisite testing updates 285 286 <!-- 287 Based on reviewers feedback describe what additional tests need to be added prior 288 implementing this enhancement to ensure the enhancements have also solid foundations. 289 --> 290 291 #### Unit Tests 292 293 <!-- 294 In principle every added code should have complete unit test coverage, so providing 295 the exact set of tests will not bring additional value. 296 However, if complete unit test coverage is not possible, explain the reason of it 297 together with explanation why this is acceptable. 298 --> 299 300 <!-- 301 Additionally, try to enumerate the core package you will be touching 302 to implement this enhancement and provide the current unit coverage for those 303 in the form of: 304 - <package>: <date> - <current test coverage> 305 306 This can inform certain test coverage improvements that we want to do before 307 extending the production code to implement this enhancement. 308 --> 309 310 - `pkg/scheduler/flavorassigner`: `7 Dec 2023` - `79.2%` (to indicate the workload which needs to borrow, that it can also preempt) 311 - `pkg/scheduler/preemption`: `7 Dec 2023` - `94%` (to update the `GetTargets` function, identifying the set of workloads to preempt) 312 313 #### Integration tests 314 315 We are going to cover the following scenarios: 316 - in the setup described in the [Story 1](#story-1), verify: 317 - a workload which require borrowing can preempt a lower priority workload 318 - a workload which require borrowing cannot preempt a higher priority workload 319 - a workload which require borrowing cannot preempt a lower priority (but above the threshold) workload. 320 - a workload which require borrowing cannot preempt a lower priority workload if the workload isn't borrowing. 321 322 More scenarios can be considered based on the actual implementation. 323 324 <!-- 325 Describe what tests will be added to ensure proper quality of the enhancement. 326 327 After the implementation PR is merged, add the names of the tests here. 328 --> 329 330 ### Graduation Criteria 331 332 <!-- 333 334 Clearly define what it means for the feature to be implemented and 335 considered stable. 336 337 If the feature you are introducing has high complexity, consider adding graduation 338 milestones with these graduation criteria: 339 - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] 340 - [Feature gate][feature gate] lifecycle 341 - [Deprecation policy][deprecation-policy] 342 343 [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md 344 [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions 345 [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ 346 --> 347 348 #### Beta 349 350 - the feature is implemented and documented as opt-in 351 352 #### GA 353 354 - all reported bugs are fixed 355 356 ## Implementation History 357 358 <!-- 359 Major milestones in the lifecycle of a KEP should be tracked in this section. 360 Major milestones might include: 361 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 362 - the `Proposal` section being merged, signaling agreement on a proposed design 363 - the date implementation started 364 - the first Kubernetes release where an initial version of the KEP was available 365 - the version of Kubernetes where the KEP graduated to general availability 366 - when the KEP was retired or superseded 367 --> 368 369 ## Drawbacks 370 371 Increased complexity of the preemption logic in Kueue, which is already complex. 372 373 <!-- 374 Why should this KEP _not_ be implemented? 375 --> 376 377 ## Alternatives 378 379 ### Reuse PreemptionPolicy enum 380 381 ```golang 382 type ClusterQueuePreemption struct { 383 ... 384 ReclaimWithinCohortConfig *ReclaimWithinCohortConfig `json:"reclaimWithCohort,omitempty"` 385 ... 386 } 387 388 type ReclaimWithinCohortConfig struct { 389 // +kubebuilder:default=Never 390 // +kubebuilder:validation:Enum=Never;LowerPriority 391 WhileBorrowingPolicy PreemptionPolicy `json:"whileBorrowingPolicy,omitempty"` 392 393 // +optional 394 PriorityThreshold *int32 `json:"priorityThreshold,omitempty"` 395 } 396 ``` 397 398 **Reasons for discarding/deferring** 399 400 The existing type has some extra options, such as `Any` which would need to be 401 rejected by validation, this is not ideal from user's perspective. 402 403 ### Minimal API 404 405 Since we only have one policy `LowerPriority`, the minimal API would be just 406 to have a boolean toggle with the `Enable` field. For example: 407 408 ```golang 409 type ReclaimWithinCohortConfig struct { 410 // +kubebuilder:default=false 411 EnableWhileBorrowing bool `json:"enableWhileBorrowing,omitempty"` 412 413 // +optional 414 PriorityThreshold *int32 `json:"priorityThreshold,omitempty"` 415 } 416 ``` 417 418 Pros: 419 - simpler, yet provides the same functionality 420 421 **Reasons for discarding/deferring** 422 423 The `EnableWhileBorrowing` field might become redundant if we introduce the `Policy` field in the future. 424 425 There might be a need to introduce `Any` option as for other preemption modes in the future. 426 Note that, the same risk as [Flopping between preempting pair of workloads](#flopping-between-preempting-pair-of-workloads) 427 would also applies when `preemption.withinClusterQueue: Any`, yet we offer `Any` 428 as an option. 429 430 ### ReclaimWithinCohortConfig 431 432 ```golang 433 type ClusterQueuePreemption struct { 434 ... 435 ReclaimWithinCohortConfig *ReclaimWithinCohortConfig `json:"reclaimWithCohort,omitempty"` 436 ... 437 } 438 439 type ReclaimWithinCohortWhileBorrowingPolicy string 440 441 const ( 442 ReclaimWithinCohortWhileBorrowingPolicyNever ReclaimWithinCohortWhileBorrowingPolicy = "Never" 443 ReclaimWithinCohortWhileBorrowingPolicyLowerPriority ReclaimWithinCohortWhileBorrowingPolicy = "LowerPriority" 444 ) 445 446 type ReclaimWithinCohortConfig struct { 447 // +kubebuilder:default=Never 448 // +kubebuilder:validation:Enum=Never;LowerPriority 449 WhileBorrowingPolicy ReclaimWithinCohortWhileBorrowingPolicy `json:"whileBorrowingPolicy,omitempty"` 450 451 // +optional 452 PriorityThreshold *int32 `json:"priorityThreshold,omitempty"` 453 } 454 ``` 455 456 **Reasons for discarding/deferring** 457 458 The field names for `ReclaimWithinCohortConfig` and `WhileBorrowingPolicy` are lengthy. 459 460 ### Move ReclaimWithinCohort under ReclaimWithinCohortConfig 461 462 We could consider moving the existing enum field `.preemption.reclaimWithinCohort` 463 under the new configuration structure: 464 465 ```golang 466 type ReclaimWithinCohortConfig struct { 467 // +kubebuilder:default=Never 468 // +kubebuilder:validation:Enum=Never;LowerPriority;Any 469 Policy PreemptionPolicy `json:"policy,omitempty"` 470 471 // +kubebuilder:default=Never 472 // +kubebuilder:validation:Enum=Never;LowerPriority 473 WhileBorrowingPolicy PreemptionPolicy `json:"whileBorrowingPolicy,omitempty"` 474 475 // +optional 476 PriorityThreshold *int32 `json:"priorityThreshold,omitempty"` 477 } 478 ``` 479 480 This might help to make the API easier to understand by putting the all the 481 configuration options for `reclaimWithinCohort` together. 482 483 **Reasons for discarding/deferring** 484 485 Kueue API is already Beta, and we would need a process to deprecate the existing 486 API. It seems that the gain to effort ratio is not good.