sigs.k8s.io/kueue@v0.6.2/keps/83-workload-preemption/README.md

sigs.k8s.io/kueue@v0.6.2/keps/83-workload-preemption/README.md (about)

1 # KEP-83: Workload preemption
2
3 
4 - [Summary](#summary)
5 - [Motivation](#motivation)
6 - [Goals](#goals)
7 - [Non-Goals](#non-goals)
8 - [Proposal](#proposal)
9 - [User Stories (Optional)](#user-stories-optional)
10 - [Story 1](#story-1)
11 - [Notes/Contraints/Caveats (Optional)](#notescontraintscaveats-optional)
12 - [Why no control to opt-out a ClusterQueue from preemption](#why-no-control-to-opt-out-a-clusterqueue-from-preemption)
13 - [Reassigning flavors after preemption](#reassigning-flavors-after-preemption)
14 - [Risks and Mitigations](#risks-and-mitigations)
15 - [Workload preemption doesn't imply immediate Pod termination](#workload-preemption-doesnt-imply-immediate-pod-termination)
16 - [Increased admission latency](#increased-admission-latency)
17 - [Design Details](#design-details)
18 - [ClusterQueue API changes](#clusterqueue-api-changes)
19 - [Changes in scheduling algorithm](#changes-in-scheduling-algorithm)
20 - [Detecting Workloads that might benefit from preemption](#detecting-workloads-that-might-benefit-from-preemption)
21 - [Sorting Workloads that are heads of ClusterQueues](#sorting-workloads-that-are-heads-of-clusterqueues)
22 - [Admission](#admission)
23 - [Preemption](#preemption)
24 - [Test Plan](#test-plan)
25 - [Prerequisite testing updates](#prerequisite-testing-updates)
26 - [Unit Tests](#unit-tests)
27 - [Integration tests](#integration-tests)
28 - [E2E tests](#e2e-tests)
29 - [Graduation Criteria](#graduation-criteria)
30 - [Implementation History](#implementation-history)
31 - [Drawbacks](#drawbacks)
32 - [Alternatives](#alternatives)
33 - [Allow high priority jobs to borrow quota while preempting](#allow-high-priority-jobs-to-borrow-quota-while-preempting)
34 - [Inform how costly is to interrupt a Workload](#inform-how-costly-is-to-interrupt-a-workload)
35 - [Penalizing long running workloads](#penalizing-long-running-workloads)
36 - [Terminating Workloads on preemption](#terminating-workloads-on-preemption)
37 - [Extra knobs in ClusterQueue preemption policy](#extra-knobs-in-clusterqueue-preemption-policy)
38 
39
40 ## Summary
41
42 This enhancement introduces workload preemption, a mechanism to suspend
43 workloads when:
44 - ClusterQueues under their minimum quota need the resources that are currently
45 borrowed by other ClusterQueues in the cohort. Alternatively, we say that the
46 ClusterQueue needs to _reclaim_ its quota.
47 - Within a ClusterQueue, there are running Workloads with lower priority than
48 a pending Workload.
49
50 API fields in the ClusterQueue spec determine preemption policies.
51
52 ## Motivation
53
54 When ClusterQueues under their minimum quota lend resources, they should
55 be able to recover those resources fast, to be able to admit Workloads
56 when there are sudden spikes. Similarly, the ClusterQueue should be able to
57 recover quota from low priority workloads that are currently running.
58
59 Currently, the only mechanism to recover those resources is to wait for
60 Workloads to finish, which is generally unbounded.
61
62 ### Goals
63
64 - Preempt Workloads from ClusterQueues borrowing resources when other
65 ClusterQueues in the cohort, under their minimum quota, need the resources.
66 - Preempt Workloads within a ClusterQueue when a high priority Workload doesn't
67 fit in the available quota, independently of borrowed quota.
68 - Introduce API fields in ClusterQueue to control when preemption occurs.
69
70 ### Non-Goals
71
72 - Graceful termination of Workloads is left to the workload pods to implement.
73 - Tracking usage by workloads that take arbitrary time to be suspended. See
74 [Workload preemption doesn't imply immediate Pod termination](#workload-preemption-doesnt-imply-immediate-pod-termination) to learn more.
75 For example, the integration with Job uses the [suspend field](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job).
76 - Partial workload preemption is not supported.
77 - Terminate workloads on preemption.
78 - Penalize workloads with the same priority that have been running for a long
79 time.
80
81 ## Proposal
82
83 This enhancement proposes the introduction of a field in the ClusterQueue to
84 determine the preemption policy for two scenarios:
85 - Reclaiming quota: a pending workload fits in the quota that is currently
86 borrowed by other ClusterQueues in the cohort.
87 - Pending high priority pod: the ClusterQueue is out of quota, but there are
88 low priority active Workloads.
89
90 The enhacement also includes an algorithm for selecting a set of Workloads to be
91 preempted from the ClusterQueue or the cohort (to reclaim borrowed Quota).
92
93 ### User Stories (Optional)
94
95 #### Story 1
96
97 As a cluster administrator, I want to control preemption of active Workloads
98 within the ClusterQueue and/or Cohort to accommodate for a pending workload.
99
100 A possible configuration looks like the following:
101
102 ```yaml
103 apiVersion: kueue.x-k8s.io/v1alpha2
104 kind: ClusterQueue
105 metadata:
106 name: cluster-total
107 spec:
108 preemption:
109 withinCohort: Always
110 withinClusterQueue: LowerPriorityOnly
111 ```
112
113 ### Notes/Contraints/Caveats (Optional)
114
115 #### Why no control to opt-out a ClusterQueue from preemption
116
117 In a cohort, some ClusterQueue could have high priority Workloads running, so
118 it might be desired not to disturb them.
119
120 However, this could be achieved by two means:
121 - Configuring the ClusterQueue with high priority Workloads to never borrow
122 (through `.quota.max`), while owning a big part or all of the quota for the
123 cohort.
124 - Configuring other ClusterQueues to not preempt workloads in the cohort when
125 reclaiming or only do so for incoming workload that have higher priority than
126 the running workloads. In other words, the control is on the ClusterQueue that
127 is lending the resources, rather than the borrower.
128
129 ### Reassigning flavors after preemption
130
131 When a Job is first admitted, kueue's job controller modifies it's pod template
132 to inject a node selector coming from the ResourceFlavor.
133
134 On preemption, the job controller resets the template back to the original
135 nodeSelector, stored in the Workload spec (implementation)[https://github.com/kubernetes-sigs/kueue/blob/f24c63accaad461dfe582b21819dbf3a5d75dd60/pkg/controller/workload/job/job_controller.go#L246-251].
136
137 ### Risks and Mitigations
138
139 #### Workload preemption doesn't imply immediate Pod termination
140
141 When Kueue issues a Workload preemption, the workload API integration controller
142 is expected to start removing Pods.
143 In the case of Kubernetes' batch.v1/Job, the following steps happen:
144 1. Kueue's job controller sets the
145 [`.spec.suspend`](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job)
146 field to true.
147 2. The Kubernetes' job controller deletes the Job's Pods.
148 3. The Kubelets send SIGTERM signals to the Pod's containers, which can have a
149 graceful termination logic.
150
151 This implies the following:
152 - Pods of a workload could implement checkpointing as part of their graceful
153 termination.
154 - The resources from these Pods are not immediately available and they could
155 be arbitrarily delayed.
156 - While Pods are terminating, a ClusterQueue's quota could be oversubscribed.
157
158 The kubernetes Job status includes the number of Pending/Running Pods that are
159 not terminating (don't have a `.metadata.deletionTimestamp`). We could use
160 this information and write the old admission spec into an annotation to keep
161 track of usage from non-terminating Pods. But this will be left for future work.
162
163 ### Increased admission latency
164
165 Calculating and executing preemption is expensive. Potentially, every
166 workload might benefit from preemption of running Workloads.
167
168 To mitigate this, we will keep track of the minimum priority among the running
169 Workloads in a ClusterQueue. If the minimum priorities are higher than or equal
170 to the incoming Workload, we will skip preemption for it altogether.
171
172 The assumption is that workloads with low priority are more common than
173 workloads with higher priority and that Workloads are sent to ClusterQueues
174 where most Workloads have the same priority.
175
176 Additionally, the preemption algorithm is mostly a linear pass over the running
177 workloads (plus sorting), so it doesn't add a significant complexity overhead
178 over building the scheduling snapshot every cycle.
179
180 The API updates from preemption will be executed in parallel.
181
182 ## Design Details
183
184 The proposal consists of new API fields and a preemption algorithm.
185
186 ### ClusterQueue API changes
187
188 The new API fields in ClusterQueue describe how to influence the selection
189 of Workloads to preempt.
190
191 ```golang
192 type ClusterQueueSpec struct {
193 ...
194 // preemption describes policies to preempt Workloads from this ClusterQueue
195 // or the ClusterQueue's cohort.
196 //
197 // Preemption can happen in two scenarios:
198 //
199 // - When a Workload fits within the min quota of the ClusterQueue, but the
200 // quota is currently borrowed by other ClusterQueues in the cohort.
201 // Preempting Workloads in other ClusterQueues allows this ClusterQueue to
202 // reclaim its min quota.
203 // - When a Workload doesn't fit within the min quota of the ClusterQueue
204 // and there are active Workloads with lower priority.
205 //
206 // The preemption algorithm tries to find a minimal set of Workloads to
207 // preempt to accomodate the pending Workload, preempting Workloads with
208 // lower priority first.
209 preemption ClusterQueuePreemption
210 }
211
212 type PreemptionPolicy string
213
214 const (
215 PreemptionPoliyNever = "Never"
216 PreemptionPoliyReclaimFromLowerPriority = "ReclaimFromLowerPriority"
217 PreemptionPoliyReclaimFromAny = "ReclaimFromAny"
218 PreemptionPoliyLowerPriority = "LowerPriority"
219 )
220
221 type ClusterQueuePreemption struct {
222 // withinCohort determines whether a pending Workload can preempt Workloads
223 // from other ClusterQueues in the cohort that are using more than their min
224 // quota.
225 // Possible values are:
226 // - `Never` (default): do not preempt workloads in the cohort.
227 // - `ReclaimFromLowerPriority`: if the pending workload fits within the min
228 // quota of its ClusterQueue, only preempt workloads in the cohort that have
229 // lower priority than the pending Workload.
230 // - `ReclaimAny`: if the pending workload fits within the min quota of its
231 // ClusterQueue, preempt any workload in the cohort.
232 WithinCohort PreemptionPolicy
233
234 // withinClusterQueue determines whether a pending workload that doesn't fit
235 // within the min quota for its ClusterQueue, can preempt active Workloads in
236 // the ClusterQueue.
237 // Possible values are:
238 // - `Never` (default): do not preempt workloads in the ClusterQueue.
239 // - `LowerPriority`: only preempt workloads in the ClusterQueue that have
240 // lower priority than the pending Workload.
241 WithinClusterQueue PreemptionPolicy
242 }
243 ```
244
245 ### Changes in scheduling algorithm
246
247 The following changes in the scheduling algorithm are required to implement
248 preemption.
249
250 #### Detecting Workloads that might benefit from preemption
251
252 The first stage during scheduling is to assign flavors to each resource of
253 a workload.
254
255 The algorithm is like follows:
256
257 For each resource (or set of resources with the same flavors), evaluate
258 flavors in the order established in the ClusterQueue*:
259
260 0. Find a flavor that still has quota in the cohort (borrowing allowed),
261 but doesn't surpass the max quota for the CQ. Keep track of whether
262 borrowing was needed.
263 1. [New step] if no flavor was found, find a flavor that is able to contain
264 the request within the min quota of the ClusterQueue. This flavor
265 assignment could be satisfied with preemption.
266
267 Some highlights:
268 - A Workload could get flavor assignments at different steps for different
269 resources.
270 - Assigments that require preemption implicitly do not borrow quota.
271
272 A flavor assignment from step 1 means that we need to preempt or wait for other
273 workloads in the cohort and/or ClusterQueue to finish to accommodate this
274 workload.
275
276 [#312](https://github.com/kubernetes-sigs/kueue/issues/312) discusses different
277 strategies to select a flavor.
278
279 #### Sorting Workloads that are heads of ClusterQueues
280
281 Sorting uses the following criteria:
282
283 1. Flavors that don't borrow first.
284 2. [New criteria] Highest priority first.
285 3. Older creation timestamp first.
286
287 Note that these criteria might put Workloads that require preemption ahead,
288 because preemption doesn't require borrowing more resources. This is desired,
289 because preemption to recover quota or admit high priority Workloads takes
290 preference over borrowing.
291
292 #### Admission
293
294 When iterating over workloads to be admitted, in the order given by the previous
295 section, we disallow borrowing in the cohort in the current cycle after
296 evaluating a Workload that doesn't require borrowing. This is the same behavior
297 that we have today, but note that this criteria now includes Workloads that need
298 preemption, because there is no preemption with borrowing quota.
299
300 This guarantees that, in future cycles, we can admit Workloads that were not
301 heads in their ClusterQueues in this cycle, but could fit without borrowing in
302 the next cycle, before lending quota to other ClusterQueues.
303
304 In the past, we only disallowed borrowing in the cohort if we were able to
305 admit the Workload, because we only kept track of flavor assignments of type 0.
306 This caused ClusterQueues in the cohort to continue
307 borrowing quota, even if there were pending Workloads that would fit under the
308 min Quota for their ClusterQueues.
309
310 It is actually possible to limit borrowing within the cohort only for the
311 flavors used by the evaluated Workloads, instead of restricting borrowing for
312 all the flavors in the cohort. But we will leave this as a future possible
313 optimization to improve throughput.
314
315 #### Preemption
316
317 For each Workload that got flavor assignments where preemption could help,
318 we run the following algorithm:
319
320 1. Check whether preemption is allowed and could help.
321
322 We skip preemption if `.preemption.withinCohort=Never` and
323 `.preemption.withinClusterQueue=Never`.
324
325 2. Obtain a list of candidate Workloads to be preempted.
326
327 1. In the cohort, we only consider ClusterQueues that are currently borrowing
328 quota. We restrict the list to Workloads with lower priority than the
329 pending Workload if `.preemption.withinCohort=ReclaimFromLowerPriority`
330 2. In the ClusterQueue, we only select Workloads with lower
331 priority than the pending Workload.
332
333 To quickly list workloads with priority lower than the incoming workload,
334 we can keep a priority queue with the priorities of active
335 Workloads in the ClusterQueue.
336
337 When going over these sets, we filter out the Workloads that are not using the
338 flavors that were selected for the incoming Workload.
339
340 If the list of candidates is empty, skip the rest of the preemption algorithm.
341
342 3. Sort the Workloads using the following criteria:
343 1. Workloads from other ClusterQueues in the cohort first.
344 2. Lower priority first.
345 3. Shortest running time first.
346
347 4. Remove Workloads from the snapshot in the order of the list. Stop removing
348 Workloads if the incoming Workload fits within the quota. Skip removing more
349 Workloads from a ClusterQueue if its usage is already below its `min` quota
350 for all the involved flavors.
351
352 The set of removed Workloads is a maximal set of Workloads that need to be
353 preempted.
354
355 5. In the reverse order of the Workloads that were removed, add Workloads back
356 as long as the incoming Workload still fits. This gives us a minimal set
357 of Workloads to preempt.
358
359 6. Preempt the Workloads by clearing `.spec.admission`.
360 The Workload will be requeued by the Workload event handler.
361
362 The incoming Workload wouldn't be admitted in this cycle. It is requeued and
363 it will be admitted once the changes in the victim Workloads are observed and
364 updated in the cache.
365
366 ### Test Plan
367
368 
378
379 [x] I/we understand the owners of the involved components may require updates to
380 existing tests to make this code solid enough prior to committing the changes necessary
381 to implement this enhancement.
382
383 ##### Prerequisite testing updates
384
385 - Need to improve coverage of `pkg/queue` up to at least 80%.
386
387 #### Unit Tests
388
389 
395
396 
405
406 - `apis/kueue/webhooks`: `2022-11-17` - `72%`
407 - `pkg/cache`: `2022-11-17` - `83%`
408 - `pkg/scheduler`: `2022-11-17` - `91%`
409 - `pkg/queue`: `2022-11-17` - `62%`
410
411 #### Integration tests
412
413 - No new workloads in the cohort can borrow when workloads in a ClusterQueue
414 fit whitin their min quota (StrictFIFO and BestEffortFIFO), but there are
415 running workloads.
416 - Preemption within a ClusterQueue based on priority.
417 - Preemption within a Cohort to reclaim min quota.
418
419 #### E2E tests
420
421 - Preemption within a ClusterQueue based on priority.
422
423 
428
429 ### Graduation Criteria
430
431 
446
447 N/A
448
449 ## Implementation History
450
451 
461
462 1. 2022-09-19: First draft, included multiple knobs.
463 2. 2022-11-17: Complete proposal with minimal API.
464
465 ## Drawbacks
466
467 
470
471 Preemption is costly to calculate. However, it's a highly demanded feature.
472 The API allows preemption to be opt-in.
473
474 ## Alternatives
475
476 The following APIs were initially proposed to enhance the control over
477 preemption, but they were left out of this KEP for lack of strong use cases.
478
479 We might add them back in the future, based on feedback.
480
481
482 ### Allow high priority jobs to borrow quota while preempting
483
484 The proposed policies for preemption within cohort require that the Workload
485 fits within the min quota of the ClusterQueue. In other words, we don't try to
486 borrow quota when preempting.
487
488 It might be desired for higher priority workloads to preempt lower priority
489 workloads that are borrowing resources, even if it makes the ClusterQueue
490 borrow resources. This could be added as `.preemption.withinCohort=LowerPriority`.
491
492 The implementation could be like the following:
493
494 For each ClusterQueue, we consider the usage as the maximum of the min quota and
495 the actual used quota. Then, we select flavors for the pending workload based on
496 this simulated usage and run the preemption algorithm.
497
498 **Reasons for discarding/deferring**
499
500 It's unclear whether this behavior is useful and it adds complexity.
501
502 ### Inform how costly is to interrupt a Workload
503
504 A workload might have a known cost of interruption that varies over time.
505 For example:
506 Early in its execution, the Workload hasn't made much progress, so it can be
507 preempted. Later, the Worload is on the path of doing significant progress, so
508 it's best not to disturb it. Lastly, the Workload is expected to have made
509 some checkpoints, so it's ok to disturb it.
510
511 This could be expressed with the following configuration:
512
513 ```yaml
514 apiVersion: kueue.x-k8s.io/v1alpha2
515 kind: Workload
516 metadata:
517 name: my-workload
518 spec:
519 preemption:
520 disruptionCostMilestones:
521 - seconds: 60
522 cost: 100
523 - seconds: 600
524 cost: 0
525 ```
526
527 The cost is a linear interpolation of the configuration above. A graphical
528 representation of the cost looks like the following (not to scale):
529
530 ```
531 cost
532
533 100 __
534 / \___
535 / \___
536 / \___
537 0 / \_
538 0 60 600 time
539 ```
540
541 As a cluster administrator, I can configure default `disruptionCostMilestones`
542 for certain workloads using webhooks or setting them for all Workloads in a
543 LocalQueue.
544
545 **Reasons for discarding/deferring**
546
547 - Users could be incentivized to increase their cost.
548 - Administrators might not be able to set a default that fits all users.
549 - The use case in [#83](https://github.com/kubernetes-sigs/kueue/issues/83#issuecomment-1224602577)
550 is mostly covered by `ClusterQueue.spec.waitBeforePreemptionSeconds`.
551
552 A better approach would be that the workload actively publishes the cost of
553 interrupting it, but this is an ongoing discussion upstream https://issues.k8s.io/107598
554
555 ### Penalizing long running workloads
556
557 A variant of the concept of cost to interrupt a workload is a penalty to
558 Workloads that have been running for a long time. For example, by allowing them
559 to be preempted by pending Workloads of the same priority after some time.
560
561 One way this could be implemented is by introducing a concept of dynamic
562 priority: the priority of a Workload could increase when they stay pending for
563 a long time, or it could be reduced as the Workload keeps running.
564
565 **Reasons for discarding/deferring**
566
567 This can be implemented separately from the preemption APIs and algorithm, with
568 specialized APIs to control priority. So it can be left for a different KEP.
569
570 ### Terminating Workloads on preemption
571
572 For some Workloads, it's not desired to restart them after preemption without
573 some manual intervention or verification (for example, interactive jobs).
574
575 This behavior could be configured like this:
576
577 ```yaml
578 apiVersion: kueue.x-k8s.io/v1alpha2
579 kind: Workload
580 metadata:
581 name: my-workload
582 spec:
583 onPreemption: Terminate # OR Requeue (default)
584 ```
585
586 **Reasons for discarding/deferring**
587
588 There is no clean mechanism to terminate a Job and all its running Pods.
589 There are two means to terminate all running Pods of a Job, but they have
590 some problems:
591
592 1. Delete the Job. The pods will be deleted (gracefully) on cascade.
593
594 This could mean loss of information for the end-user, unless they have a
595 finalizer on the Job. In a sense, it violates
596 [`ttlSecondsAfterFinish`](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/)
597
598 2. Just suspend the Job.
599
600 This option leaves a Job that is not finished, then it wouldn't be
601 cleaned after `ttlSecondsAfterFinish`
602 couldn't clean it up.
603
604 Simply adding a `Failed` condition after suspending the Job could leave its
605 Pods running indefinitely if the kubernetes job controller doesn't have a
606 chance to delete all the Pods based on the `.spec.suspend` field.
607
608 One possibility is to insert the `FailureTarget` condition in the Job status,
609 introduced by [KEP#3329](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures)
610 for a different purpose.
611
612 Perhaps we should have an explicit API for this behavior, but it needs to be
613 done upstream.
614
615 Similar work needs to be done for workload CRDs.
616 We should have an explicit API for this behavior
617
618 ### Extra knobs in ClusterQueue preemption policy
619
620 These extra knobs could enhance the control over preemption:
621
622 ```golang
623 type ClusterQueuePremption struct {
624 // triggerAfterWorkloadWaitingSeconds is the time in seconds that Workloads
625 // in this ClusterQueue will wait before triggering preemptions of active
626 // workloads in this ClusterQueue or its cohort (when reclaiming quota).
627 //
628 // The time is measured from the first time the workload was attempted for
629 // admission. This value is present as the `transitionTimestamp` of the
630 // Admitted condition, with status=False.
631 TriggerAfterWorkloadWaitingSeconds int64
632
633 // workloadSorting determines how Workloads from the cohort that are
634 // candidates for preemption are sorted.
635 // Sorting happens at the time when a Workload in this ClusterQueue is
636 // evaluated for admission. All the Workloads in the cohort are sorted based
637 // on the criteria defined in the preempting ClusterQueue.
638 // workloadOrder is a list of comparison criteria between two Workloads that
639 // are evaluated in order.
640 // Possible criteria are:
641 // - ByLowestPriority: Prefer to preempt the Workload with lower priority.
642 // - ByLowestRuntime: Prefer to preempt the Workload that started more
643 // recently.
644 // - ByLongestRuntime: Prefer to preempt the Workload that started earlier.
645 //
646 // If empty, the behavior is equivalent to
647 // [ByLowestPriority, ByLowestRuntime].
648 WorkloadSorting []WorkloadComparison
649
650 type WorkloadSortingCriteria string
651
652 const (
653 ComparisonByLowestPriority = "ByLowestPriority"
654 ComparisonByLowestRuntime = "ByLowestRuntime"
655 )
656 }
657 ```
658
659 The proposed field `ClusterQueue.spec.preemption.triggerAfterWorkloadWaitingSeconds`
660 can be interpreted in two ways:
661 1. **How long jobs are willing to wait**.
662 This shouldn't be problematic. The field can be configured based purely on
663 the importance of the Workloads served by the preempting ClusterQueue.
664 2. **The characteristics of the workloads in the cohort**; for example, how long
665 they take to finish or how often they perform checkpointing, on average.
666 This implies that all workloads in the cohort have similar characteristics
667 and all the ClusterQueues in the cohort should have the same wait period.
668
669 This caveat should be part of the documentation as a best practice for how to
670 setup the field.
671
672
673 **Reasons for discarding/deferring**
674
675 The usefulness of the field `triggerAfterWorkloadWaitingSeconds` is somewhat
676 questionable when the ClusterQueue is saturated (all the workloads require
677 preemption). If the ClusterQueue is in `BestEffortFIFO` mode, it's possible
678 that all the elements will trigger preemption once the deadline for at least
679 one Workload is satisfied.
680
681 For simplicity of the API, we will start with implicit sorting rules.