sigs.k8s.io/kueue@v0.6.2/keps/83-workload-preemption/README.md (about) 1 # KEP-83: Workload preemption 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories (Optional)](#user-stories-optional) 10 - [Story 1](#story-1) 11 - [Notes/Contraints/Caveats (Optional)](#notescontraintscaveats-optional) 12 - [Why no control to opt-out a ClusterQueue from preemption](#why-no-control-to-opt-out-a-clusterqueue-from-preemption) 13 - [Reassigning flavors after preemption](#reassigning-flavors-after-preemption) 14 - [Risks and Mitigations](#risks-and-mitigations) 15 - [Workload preemption doesn't imply immediate Pod termination](#workload-preemption-doesnt-imply-immediate-pod-termination) 16 - [Increased admission latency](#increased-admission-latency) 17 - [Design Details](#design-details) 18 - [ClusterQueue API changes](#clusterqueue-api-changes) 19 - [Changes in scheduling algorithm](#changes-in-scheduling-algorithm) 20 - [Detecting Workloads that might benefit from preemption](#detecting-workloads-that-might-benefit-from-preemption) 21 - [Sorting Workloads that are heads of ClusterQueues](#sorting-workloads-that-are-heads-of-clusterqueues) 22 - [Admission](#admission) 23 - [Preemption](#preemption) 24 - [Test Plan](#test-plan) 25 - [Prerequisite testing updates](#prerequisite-testing-updates) 26 - [Unit Tests](#unit-tests) 27 - [Integration tests](#integration-tests) 28 - [E2E tests](#e2e-tests) 29 - [Graduation Criteria](#graduation-criteria) 30 - [Implementation History](#implementation-history) 31 - [Drawbacks](#drawbacks) 32 - [Alternatives](#alternatives) 33 - [Allow high priority jobs to borrow quota while preempting](#allow-high-priority-jobs-to-borrow-quota-while-preempting) 34 - [Inform how costly is to interrupt a Workload](#inform-how-costly-is-to-interrupt-a-workload) 35 - [Penalizing long running workloads](#penalizing-long-running-workloads) 36 - [Terminating Workloads on preemption](#terminating-workloads-on-preemption) 37 - [Extra knobs in ClusterQueue preemption policy](#extra-knobs-in-clusterqueue-preemption-policy) 38 <!-- /toc --> 39 40 ## Summary 41 42 This enhancement introduces workload preemption, a mechanism to suspend 43 workloads when: 44 - ClusterQueues under their minimum quota need the resources that are currently 45 borrowed by other ClusterQueues in the cohort. Alternatively, we say that the 46 ClusterQueue needs to _reclaim_ its quota. 47 - Within a ClusterQueue, there are running Workloads with lower priority than 48 a pending Workload. 49 50 API fields in the ClusterQueue spec determine preemption policies. 51 52 ## Motivation 53 54 When ClusterQueues under their minimum quota lend resources, they should 55 be able to recover those resources fast, to be able to admit Workloads 56 when there are sudden spikes. Similarly, the ClusterQueue should be able to 57 recover quota from low priority workloads that are currently running. 58 59 Currently, the only mechanism to recover those resources is to wait for 60 Workloads to finish, which is generally unbounded. 61 62 ### Goals 63 64 - Preempt Workloads from ClusterQueues borrowing resources when other 65 ClusterQueues in the cohort, under their minimum quota, need the resources. 66 - Preempt Workloads within a ClusterQueue when a high priority Workload doesn't 67 fit in the available quota, independently of borrowed quota. 68 - Introduce API fields in ClusterQueue to control when preemption occurs. 69 70 ### Non-Goals 71 72 - Graceful termination of Workloads is left to the workload pods to implement. 73 - Tracking usage by workloads that take arbitrary time to be suspended. See 74 [Workload preemption doesn't imply immediate Pod termination](#workload-preemption-doesnt-imply-immediate-pod-termination) to learn more. 75 For example, the integration with Job uses the [suspend field](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job). 76 - Partial workload preemption is not supported. 77 - Terminate workloads on preemption. 78 - Penalize workloads with the same priority that have been running for a long 79 time. 80 81 ## Proposal 82 83 This enhancement proposes the introduction of a field in the ClusterQueue to 84 determine the preemption policy for two scenarios: 85 - Reclaiming quota: a pending workload fits in the quota that is currently 86 borrowed by other ClusterQueues in the cohort. 87 - Pending high priority pod: the ClusterQueue is out of quota, but there are 88 low priority active Workloads. 89 90 The enhacement also includes an algorithm for selecting a set of Workloads to be 91 preempted from the ClusterQueue or the cohort (to reclaim borrowed Quota). 92 93 ### User Stories (Optional) 94 95 #### Story 1 96 97 As a cluster administrator, I want to control preemption of active Workloads 98 within the ClusterQueue and/or Cohort to accommodate for a pending workload. 99 100 A possible configuration looks like the following: 101 102 ```yaml 103 apiVersion: kueue.x-k8s.io/v1alpha2 104 kind: ClusterQueue 105 metadata: 106 name: cluster-total 107 spec: 108 preemption: 109 withinCohort: Always 110 withinClusterQueue: LowerPriorityOnly 111 ``` 112 113 ### Notes/Contraints/Caveats (Optional) 114 115 #### Why no control to opt-out a ClusterQueue from preemption 116 117 In a cohort, some ClusterQueue could have high priority Workloads running, so 118 it might be desired not to disturb them. 119 120 However, this could be achieved by two means: 121 - Configuring the ClusterQueue with high priority Workloads to never borrow 122 (through `.quota.max`), while owning a big part or all of the quota for the 123 cohort. 124 - Configuring other ClusterQueues to not preempt workloads in the cohort when 125 reclaiming or only do so for incoming workload that have higher priority than 126 the running workloads. In other words, the control is on the ClusterQueue that 127 is lending the resources, rather than the borrower. 128 129 ### Reassigning flavors after preemption 130 131 When a Job is first admitted, kueue's job controller modifies it's pod template 132 to inject a node selector coming from the ResourceFlavor. 133 134 On preemption, the job controller resets the template back to the original 135 nodeSelector, stored in the Workload spec (implementation)[https://github.com/kubernetes-sigs/kueue/blob/f24c63accaad461dfe582b21819dbf3a5d75dd60/pkg/controller/workload/job/job_controller.go#L246-251]. 136 137 ### Risks and Mitigations 138 139 #### Workload preemption doesn't imply immediate Pod termination 140 141 When Kueue issues a Workload preemption, the workload API integration controller 142 is expected to start removing Pods. 143 In the case of Kubernetes' batch.v1/Job, the following steps happen: 144 1. Kueue's job controller sets the 145 [`.spec.suspend`](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job) 146 field to true. 147 2. The Kubernetes' job controller deletes the Job's Pods. 148 3. The Kubelets send SIGTERM signals to the Pod's containers, which can have a 149 graceful termination logic. 150 151 This implies the following: 152 - Pods of a workload could implement checkpointing as part of their graceful 153 termination. 154 - The resources from these Pods are not immediately available and they could 155 be arbitrarily delayed. 156 - While Pods are terminating, a ClusterQueue's quota could be oversubscribed. 157 158 The kubernetes Job status includes the number of Pending/Running Pods that are 159 not terminating (don't have a `.metadata.deletionTimestamp`). We could use 160 this information and write the old admission spec into an annotation to keep 161 track of usage from non-terminating Pods. But this will be left for future work. 162 163 ### Increased admission latency 164 165 Calculating and executing preemption is expensive. Potentially, every 166 workload might benefit from preemption of running Workloads. 167 168 To mitigate this, we will keep track of the minimum priority among the running 169 Workloads in a ClusterQueue. If the minimum priorities are higher than or equal 170 to the incoming Workload, we will skip preemption for it altogether. 171 172 The assumption is that workloads with low priority are more common than 173 workloads with higher priority and that Workloads are sent to ClusterQueues 174 where most Workloads have the same priority. 175 176 Additionally, the preemption algorithm is mostly a linear pass over the running 177 workloads (plus sorting), so it doesn't add a significant complexity overhead 178 over building the scheduling snapshot every cycle. 179 180 The API updates from preemption will be executed in parallel. 181 182 ## Design Details 183 184 The proposal consists of new API fields and a preemption algorithm. 185 186 ### ClusterQueue API changes 187 188 The new API fields in ClusterQueue describe how to influence the selection 189 of Workloads to preempt. 190 191 ```golang 192 type ClusterQueueSpec struct { 193 ... 194 // preemption describes policies to preempt Workloads from this ClusterQueue 195 // or the ClusterQueue's cohort. 196 // 197 // Preemption can happen in two scenarios: 198 // 199 // - When a Workload fits within the min quota of the ClusterQueue, but the 200 // quota is currently borrowed by other ClusterQueues in the cohort. 201 // Preempting Workloads in other ClusterQueues allows this ClusterQueue to 202 // reclaim its min quota. 203 // - When a Workload doesn't fit within the min quota of the ClusterQueue 204 // and there are active Workloads with lower priority. 205 // 206 // The preemption algorithm tries to find a minimal set of Workloads to 207 // preempt to accomodate the pending Workload, preempting Workloads with 208 // lower priority first. 209 preemption ClusterQueuePreemption 210 } 211 212 type PreemptionPolicy string 213 214 const ( 215 PreemptionPoliyNever = "Never" 216 PreemptionPoliyReclaimFromLowerPriority = "ReclaimFromLowerPriority" 217 PreemptionPoliyReclaimFromAny = "ReclaimFromAny" 218 PreemptionPoliyLowerPriority = "LowerPriority" 219 ) 220 221 type ClusterQueuePreemption struct { 222 // withinCohort determines whether a pending Workload can preempt Workloads 223 // from other ClusterQueues in the cohort that are using more than their min 224 // quota. 225 // Possible values are: 226 // - `Never` (default): do not preempt workloads in the cohort. 227 // - `ReclaimFromLowerPriority`: if the pending workload fits within the min 228 // quota of its ClusterQueue, only preempt workloads in the cohort that have 229 // lower priority than the pending Workload. 230 // - `ReclaimAny`: if the pending workload fits within the min quota of its 231 // ClusterQueue, preempt any workload in the cohort. 232 WithinCohort PreemptionPolicy 233 234 // withinClusterQueue determines whether a pending workload that doesn't fit 235 // within the min quota for its ClusterQueue, can preempt active Workloads in 236 // the ClusterQueue. 237 // Possible values are: 238 // - `Never` (default): do not preempt workloads in the ClusterQueue. 239 // - `LowerPriority`: only preempt workloads in the ClusterQueue that have 240 // lower priority than the pending Workload. 241 WithinClusterQueue PreemptionPolicy 242 } 243 ``` 244 245 ### Changes in scheduling algorithm 246 247 The following changes in the scheduling algorithm are required to implement 248 preemption. 249 250 #### Detecting Workloads that might benefit from preemption 251 252 The first stage during scheduling is to assign flavors to each resource of 253 a workload. 254 255 The algorithm is like follows: 256 257 For each resource (or set of resources with the same flavors), evaluate 258 flavors in the order established in the ClusterQueue*: 259 260 0. Find a flavor that still has quota in the cohort (borrowing allowed), 261 but doesn't surpass the max quota for the CQ. Keep track of whether 262 borrowing was needed. 263 1. [New step] if no flavor was found, find a flavor that is able to contain 264 the request within the min quota of the ClusterQueue. This flavor 265 assignment could be satisfied with preemption. 266 267 Some highlights: 268 - A Workload could get flavor assignments at different steps for different 269 resources. 270 - Assigments that require preemption implicitly do not borrow quota. 271 272 A flavor assignment from step 1 means that we need to preempt or wait for other 273 workloads in the cohort and/or ClusterQueue to finish to accommodate this 274 workload. 275 276 [#312](https://github.com/kubernetes-sigs/kueue/issues/312) discusses different 277 strategies to select a flavor. 278 279 #### Sorting Workloads that are heads of ClusterQueues 280 281 Sorting uses the following criteria: 282 283 1. Flavors that don't borrow first. 284 2. [New criteria] Highest priority first. 285 3. Older creation timestamp first. 286 287 Note that these criteria might put Workloads that require preemption ahead, 288 because preemption doesn't require borrowing more resources. This is desired, 289 because preemption to recover quota or admit high priority Workloads takes 290 preference over borrowing. 291 292 #### Admission 293 294 When iterating over workloads to be admitted, in the order given by the previous 295 section, we disallow borrowing in the cohort in the current cycle after 296 evaluating a Workload that doesn't require borrowing. This is the same behavior 297 that we have today, but note that this criteria now includes Workloads that need 298 preemption, because there is no preemption with borrowing quota. 299 300 This guarantees that, in future cycles, we can admit Workloads that were not 301 heads in their ClusterQueues in this cycle, but could fit without borrowing in 302 the next cycle, before lending quota to other ClusterQueues. 303 304 In the past, we only disallowed borrowing in the cohort if we were able to 305 admit the Workload, because we only kept track of flavor assignments of type 0. 306 This caused ClusterQueues in the cohort to continue 307 borrowing quota, even if there were pending Workloads that would fit under the 308 min Quota for their ClusterQueues. 309 310 It is actually possible to limit borrowing within the cohort only for the 311 flavors used by the evaluated Workloads, instead of restricting borrowing for 312 all the flavors in the cohort. But we will leave this as a future possible 313 optimization to improve throughput. 314 315 #### Preemption 316 317 For each Workload that got flavor assignments where preemption could help, 318 we run the following algorithm: 319 320 1. Check whether preemption is allowed and could help. 321 322 We skip preemption if `.preemption.withinCohort=Never` and 323 `.preemption.withinClusterQueue=Never`. 324 325 2. Obtain a list of candidate Workloads to be preempted. 326 327 1. In the cohort, we only consider ClusterQueues that are currently borrowing 328 quota. We restrict the list to Workloads with lower priority than the 329 pending Workload if `.preemption.withinCohort=ReclaimFromLowerPriority` 330 2. In the ClusterQueue, we only select Workloads with lower 331 priority than the pending Workload. 332 333 To quickly list workloads with priority lower than the incoming workload, 334 we can keep a priority queue with the priorities of active 335 Workloads in the ClusterQueue. 336 337 When going over these sets, we filter out the Workloads that are not using the 338 flavors that were selected for the incoming Workload. 339 340 If the list of candidates is empty, skip the rest of the preemption algorithm. 341 342 3. Sort the Workloads using the following criteria: 343 1. Workloads from other ClusterQueues in the cohort first. 344 2. Lower priority first. 345 3. Shortest running time first. 346 347 4. Remove Workloads from the snapshot in the order of the list. Stop removing 348 Workloads if the incoming Workload fits within the quota. Skip removing more 349 Workloads from a ClusterQueue if its usage is already below its `min` quota 350 for all the involved flavors. 351 352 The set of removed Workloads is a maximal set of Workloads that need to be 353 preempted. 354 355 5. In the reverse order of the Workloads that were removed, add Workloads back 356 as long as the incoming Workload still fits. This gives us a minimal set 357 of Workloads to preempt. 358 359 6. Preempt the Workloads by clearing `.spec.admission`. 360 The Workload will be requeued by the Workload event handler. 361 362 The incoming Workload wouldn't be admitted in this cycle. It is requeued and 363 it will be admitted once the changes in the victim Workloads are observed and 364 updated in the cache. 365 366 ### Test Plan 367 368 <!-- 369 **Note:** *Not required until targeted at a release.* 370 The goal is to ensure that we don't accept enhancements with inadequate testing. 371 372 All code is expected to have adequate tests (eventually with coverage 373 expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] 374 when drafting this test plan. 375 376 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 377 --> 378 379 [x] I/we understand the owners of the involved components may require updates to 380 existing tests to make this code solid enough prior to committing the changes necessary 381 to implement this enhancement. 382 383 ##### Prerequisite testing updates 384 385 - Need to improve coverage of `pkg/queue` up to at least 80%. 386 387 #### Unit Tests 388 389 <!-- 390 In principle every added code should have complete unit test coverage, so providing 391 the exact set of tests will not bring additional value. 392 However, if complete unit test coverage is not possible, explain the reason of it 393 together with explanation why this is acceptable. 394 --> 395 396 <!-- 397 Additionally, try to enumerate the core package you will be touching 398 to implement this enhancement and provide the current unit coverage for those 399 in the form of: 400 - <package>: <date> - <current test coverage> 401 402 This can inform certain test coverage improvements that we want to do before 403 extending the production code to implement this enhancement. 404 --> 405 406 - `apis/kueue/webhooks`: `2022-11-17` - `72%` 407 - `pkg/cache`: `2022-11-17` - `83%` 408 - `pkg/scheduler`: `2022-11-17` - `91%` 409 - `pkg/queue`: `2022-11-17` - `62%` 410 411 #### Integration tests 412 413 - No new workloads in the cohort can borrow when workloads in a ClusterQueue 414 fit whitin their min quota (StrictFIFO and BestEffortFIFO), but there are 415 running workloads. 416 - Preemption within a ClusterQueue based on priority. 417 - Preemption within a Cohort to reclaim min quota. 418 419 #### E2E tests 420 421 - Preemption within a ClusterQueue based on priority. 422 423 <!-- 424 Describe what tests will be added to ensure proper quality of the enhancement. 425 426 After the implementation PR is merged, add the names of the tests here. 427 --> 428 429 ### Graduation Criteria 430 431 <!-- 432 433 Clearly define what it means for the feature to be implemented and 434 considered stable. 435 436 If the feature you are introducing has high complexity, consider adding graduation 437 milestones with these graduation criteria: 438 - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] 439 - [Feature gate][feature gate] lifecycle 440 - [Deprecation policy][deprecation-policy] 441 442 [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md 443 [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions 444 [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ 445 --> 446 447 N/A 448 449 ## Implementation History 450 451 <!-- 452 Major milestones in the lifecycle of a KEP should be tracked in this section. 453 Major milestones might include: 454 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 455 - the `Proposal` section being merged, signaling agreement on a proposed design 456 - the date implementation started 457 - the first Kubernetes release where an initial version of the KEP was available 458 - the version of Kubernetes where the KEP graduated to general availability 459 - when the KEP was retired or superseded 460 --> 461 462 1. 2022-09-19: First draft, included multiple knobs. 463 2. 2022-11-17: Complete proposal with minimal API. 464 465 ## Drawbacks 466 467 <!-- 468 Why should this KEP _not_ be implemented? 469 --> 470 471 Preemption is costly to calculate. However, it's a highly demanded feature. 472 The API allows preemption to be opt-in. 473 474 ## Alternatives 475 476 The following APIs were initially proposed to enhance the control over 477 preemption, but they were left out of this KEP for lack of strong use cases. 478 479 We might add them back in the future, based on feedback. 480 481 482 ### Allow high priority jobs to borrow quota while preempting 483 484 The proposed policies for preemption within cohort require that the Workload 485 fits within the min quota of the ClusterQueue. In other words, we don't try to 486 borrow quota when preempting. 487 488 It might be desired for higher priority workloads to preempt lower priority 489 workloads that are borrowing resources, even if it makes the ClusterQueue 490 borrow resources. This could be added as `.preemption.withinCohort=LowerPriority`. 491 492 The implementation could be like the following: 493 494 For each ClusterQueue, we consider the usage as the maximum of the min quota and 495 the actual used quota. Then, we select flavors for the pending workload based on 496 this simulated usage and run the preemption algorithm. 497 498 **Reasons for discarding/deferring** 499 500 It's unclear whether this behavior is useful and it adds complexity. 501 502 ### Inform how costly is to interrupt a Workload 503 504 A workload might have a known cost of interruption that varies over time. 505 For example: 506 Early in its execution, the Workload hasn't made much progress, so it can be 507 preempted. Later, the Worload is on the path of doing significant progress, so 508 it's best not to disturb it. Lastly, the Workload is expected to have made 509 some checkpoints, so it's ok to disturb it. 510 511 This could be expressed with the following configuration: 512 513 ```yaml 514 apiVersion: kueue.x-k8s.io/v1alpha2 515 kind: Workload 516 metadata: 517 name: my-workload 518 spec: 519 preemption: 520 disruptionCostMilestones: 521 - seconds: 60 522 cost: 100 523 - seconds: 600 524 cost: 0 525 ``` 526 527 The cost is a linear interpolation of the configuration above. A graphical 528 representation of the cost looks like the following (not to scale): 529 530 ``` 531 cost 532 533 100 __ 534 / \___ 535 / \___ 536 / \___ 537 0 / \_ 538 0 60 600 time 539 ``` 540 541 As a cluster administrator, I can configure default `disruptionCostMilestones` 542 for certain workloads using webhooks or setting them for all Workloads in a 543 LocalQueue. 544 545 **Reasons for discarding/deferring** 546 547 - Users could be incentivized to increase their cost. 548 - Administrators might not be able to set a default that fits all users. 549 - The use case in [#83](https://github.com/kubernetes-sigs/kueue/issues/83#issuecomment-1224602577) 550 is mostly covered by `ClusterQueue.spec.waitBeforePreemptionSeconds`. 551 552 A better approach would be that the workload actively publishes the cost of 553 interrupting it, but this is an ongoing discussion upstream https://issues.k8s.io/107598 554 555 ### Penalizing long running workloads 556 557 A variant of the concept of cost to interrupt a workload is a penalty to 558 Workloads that have been running for a long time. For example, by allowing them 559 to be preempted by pending Workloads of the same priority after some time. 560 561 One way this could be implemented is by introducing a concept of dynamic 562 priority: the priority of a Workload could increase when they stay pending for 563 a long time, or it could be reduced as the Workload keeps running. 564 565 **Reasons for discarding/deferring** 566 567 This can be implemented separately from the preemption APIs and algorithm, with 568 specialized APIs to control priority. So it can be left for a different KEP. 569 570 ### Terminating Workloads on preemption 571 572 For some Workloads, it's not desired to restart them after preemption without 573 some manual intervention or verification (for example, interactive jobs). 574 575 This behavior could be configured like this: 576 577 ```yaml 578 apiVersion: kueue.x-k8s.io/v1alpha2 579 kind: Workload 580 metadata: 581 name: my-workload 582 spec: 583 onPreemption: Terminate # OR Requeue (default) 584 ``` 585 586 **Reasons for discarding/deferring** 587 588 There is no clean mechanism to terminate a Job and all its running Pods. 589 There are two means to terminate all running Pods of a Job, but they have 590 some problems: 591 592 1. Delete the Job. The pods will be deleted (gracefully) on cascade. 593 594 This could mean loss of information for the end-user, unless they have a 595 finalizer on the Job. In a sense, it violates 596 [`ttlSecondsAfterFinish`](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/) 597 598 2. Just suspend the Job. 599 600 This option leaves a Job that is not finished, then it wouldn't be 601 cleaned after `ttlSecondsAfterFinish` 602 couldn't clean it up. 603 604 Simply adding a `Failed` condition after suspending the Job could leave its 605 Pods running indefinitely if the kubernetes job controller doesn't have a 606 chance to delete all the Pods based on the `.spec.suspend` field. 607 608 One possibility is to insert the `FailureTarget` condition in the Job status, 609 introduced by [KEP#3329](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures) 610 for a different purpose. 611 612 Perhaps we should have an explicit API for this behavior, but it needs to be 613 done upstream. 614 615 Similar work needs to be done for workload CRDs. 616 We should have an explicit API for this behavior 617 618 ### Extra knobs in ClusterQueue preemption policy 619 620 These extra knobs could enhance the control over preemption: 621 622 ```golang 623 type ClusterQueuePremption struct { 624 // triggerAfterWorkloadWaitingSeconds is the time in seconds that Workloads 625 // in this ClusterQueue will wait before triggering preemptions of active 626 // workloads in this ClusterQueue or its cohort (when reclaiming quota). 627 // 628 // The time is measured from the first time the workload was attempted for 629 // admission. This value is present as the `transitionTimestamp` of the 630 // Admitted condition, with status=False. 631 TriggerAfterWorkloadWaitingSeconds int64 632 633 // workloadSorting determines how Workloads from the cohort that are 634 // candidates for preemption are sorted. 635 // Sorting happens at the time when a Workload in this ClusterQueue is 636 // evaluated for admission. All the Workloads in the cohort are sorted based 637 // on the criteria defined in the preempting ClusterQueue. 638 // workloadOrder is a list of comparison criteria between two Workloads that 639 // are evaluated in order. 640 // Possible criteria are: 641 // - ByLowestPriority: Prefer to preempt the Workload with lower priority. 642 // - ByLowestRuntime: Prefer to preempt the Workload that started more 643 // recently. 644 // - ByLongestRuntime: Prefer to preempt the Workload that started earlier. 645 // 646 // If empty, the behavior is equivalent to 647 // [ByLowestPriority, ByLowestRuntime]. 648 WorkloadSorting []WorkloadComparison 649 650 type WorkloadSortingCriteria string 651 652 const ( 653 ComparisonByLowestPriority = "ByLowestPriority" 654 ComparisonByLowestRuntime = "ByLowestRuntime" 655 ) 656 } 657 ``` 658 659 The proposed field `ClusterQueue.spec.preemption.triggerAfterWorkloadWaitingSeconds` 660 can be interpreted in two ways: 661 1. **How long jobs are willing to wait**. 662 This shouldn't be problematic. The field can be configured based purely on 663 the importance of the Workloads served by the preempting ClusterQueue. 664 2. **The characteristics of the workloads in the cohort**; for example, how long 665 they take to finish or how often they perform checkpointing, on average. 666 This implies that all workloads in the cohort have similar characteristics 667 and all the ClusterQueues in the cohort should have the same wait period. 668 669 This caveat should be part of the documentation as a best practice for how to 670 setup the field. 671 672 673 **Reasons for discarding/deferring** 674 675 The usefulness of the field `triggerAfterWorkloadWaitingSeconds` is somewhat 676 questionable when the ClusterQueue is saturated (all the workloads require 677 preemption). If the ClusterQueue is in `BestEffortFIFO` mode, it's possible 678 that all the elements will trigger preemption once the deadline for at least 679 one Workload is satisfied. 680 681 For simplicity of the API, we will start with implicit sorting rules.