sigs.k8s.io/kueue@v0.6.2/keps/349-all-or-nothing/README.md

sigs.k8s.io/kueue@v0.6.2/keps/349-all-or-nothing/README.md (about)

1 # KEP-349: All-or-nothing semantics for job resource assignment
2
3 
8
9 
18
19 
20 - [Summary](#summary)
21 - [Motivation](#motivation)
22 - [Goals](#goals)
23 - [Non-Goals](#non-goals)
24 - [Proposal](#proposal)
25 - [User Stories (Optional)](#user-stories-optional)
26 - [Story 1](#story-1)
27 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
28 - [Risks and Mitigations](#risks-and-mitigations)
29 - [Design Details](#design-details)
30 - [Kueue Configuration API](#kueue-configuration-api)
31 - [PodsReady workload condition](#podsready-workload-condition)
32 - [Waiting for PodsReady condition](#waiting-for-podsready-condition)
33 - [Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)
34 - [Test Plan](#test-plan)
35 - [Prerequisite testing updates](#prerequisite-testing-updates)
36 - [Unit Tests](#unit-tests)
37 - [Integration tests](#integration-tests)
38 - [Graduation Criteria](#graduation-criteria)
39 - [Implementation History](#implementation-history)
40 - [Drawbacks](#drawbacks)
41 - [Alternatives](#alternatives)
42 - [Delay job start instead of workload admission](#delay-job-start-instead-of-workload-admission)
43 - [Pod Resource Reservation](#pod-resource-reservation)
44 - [More granular configuration to enable the mechanism](#more-granular-configuration-to-enable-the-mechanism)
45 
46
47 ## Summary
48
49 This proposal introduces an opt-in mechanism to ensure that a job gets the
50 physical resources assigned once unsuspended by Kueue.
51
52 
70
71 ## Motivation
72
73 Some jobs need all pods to be running at the same time to make progress, for
74 example, when they require pod-to-pod communication. In that case a pair of
75 large jobs may deadlock if there are issues with resource provisioning to
76 match the configured cluster quota. The same pair of jobs could run to
77 completion if their pods were scheduled sequentially.
78
79 
87
88 ### Goals
89
90 - a mechanism to ensure that a job gets assigned physical resources when
91 unsuspended by Kueue
92 - a timeout on getting the physical resources assigned by a Job since
93 unsuspended by Kueue
94
95 
99
100 ### Non-Goals
101
102 - guarantee that two jobs would not schedule pods concurrently. Example
103 scenarios in which two jobs may still concurrently schedule their pods:
104 - when succeeded pods are replaced with new because job's parallelism is less than its completions;
105 - when a failed pod gets replaced
106
107 
111
112 ## Proposal
113
114 We introduce a mechanism to ensure jobs get their physical resources
115 assigned by avoiding concurrent scheduling of their pods. More precisely, we
116 block admission of new workloads until the first batch of pods for the
117 unsuspended job is scheduled. This behavior can be opted-in at the level of
118 the Kueue configuration.
119
120 
128
129 ### User Stories (Optional)
130
131 
137
138 #### Story 1
139
140 As a Kueue administrator I want to ensure that two or more Jobs, which require
141 all pods to be running at the same time, would not deadlock when scheduling
142 their pods. This could happen in case of node provisioning issues to match
143 the configured cluster queue quota and when the Jobs don't specify priorities
144 (or specify the same priority).
145
146 My use case can be supported by enabling `waitForPodsReady` in the Kueue
147 configuration.
148
149 ### Notes/Constraints/Caveats (Optional)
150
151 
157
158 ### Risks and Mitigations
159
160 If a workload fails to schedule its pods it could block admission of other
161 workloads indefinitely.
162
163 To mitigate this issue we introduce a timeout on reaching the `PodsReady`
164 condition by a workload since its job start (see:
165 [Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)).
166
167 
178
179 ## Design Details
180
181 
187
188 ### Kueue Configuration API
189
190 We extend the global Kueue Configuration API to introduce the new fields:
191 `waitForPodsReady` to opt-in and configure the new behavior.
192
193 ```golang
194 // Configuration is the Schema for the kueueconfigurations API
195 type Configuration struct {
196 ...
197 // WaitForPodsReady is configuration for waitForPodsReady
198 WaitForPodsReady *WaitForPodsReady `json:"waitForPodsReady,omitempty"`
199 }
200
201 type WaitForPodsReady struct {
202 // Enable when true, indicates that each admitted workload
203 // blocks admission of other workloads in the cluster, until it is in the
204 // `PodsReady` condition. If false, all workloads start as soon as they are
205 // admitted and do not block admission of other workloads. The PodsReady
206 // condition is only added if this setting is enabled. If unspecified,
207 // it defaults to false.
208 Enable *bool `json:"enable,omitempty"`
209
210 // timeoutSeconds defines optional time duration in seconds, relative to the
211 // job.status.StartTime, it can take an admitted workload to reach
212 // the PodsReady condition.
213 // After exceeding the timeout the corresponding job gets suspended again
214 // and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is
215 // enforced only if waitForPodsReady.enable=true. If unspecified, it defaults to 5min.
216 // +optional
217 TimeoutSeconds *int64 `json:"timeoutSeconds,omitempty"`
218 }
219
220 ```
221
222 ### PodsReady workload condition
223
224 We introduce a new workload condition, called `PodsReady`, to indicate
225 if the workload's startup requirements are satisfied. More precisely, we add
226 the condition when `job.status.ready + job.status.succeeded` is greater or equal
227 than `job.spec.parallelism`.
228
229 Note that, we don't take failed pods into account when verifying if the
230 `PodsReady` condition should be added. However, a buggy admitted workload is
231 eliminated as the corresponding job fails due to exceeding the `.spec.backoffLimit`
232 limit.
233
234 The `PodsReady` condition is added to the workload by the Kueue's Job
235 Controller in reaction to a status update of the corresponding Job. Note that,
236 verifying if the condition should be added does not require an extra API call as
237 the Kueue's Job Controller already fetches the latest Job object at the
238 beginning of the `Reconcile` function.
239
240 This condition is added only when `waitForPodsReady` is enabled in the
241 Kueue configuration.
242
243 ### Waiting for PodsReady condition
244
245 When the mechanism is enabled, for each admitted workload Kueue's scheduler
246 blocks admission of queued workloads until the workload has the `PodsReady`
247 condition. Kueue's scheduler verifies the workload state by a lookup to the
248 cache of admitted workloads.
249
250 Note that, because the mechanism is enabled for all workloads, when a workload
251 gets admitted, all other admitted workloads are already in the `PodsReady`
252 condition, so the corresponding job is unsuspended without further waiting.
253
254 ### Timeout on reaching the PodsReady condition
255
256 We introduce a timeout, defined in the `waitForPodsReady.timeoutSeconds` field, on reaching the `PodsReady` condition since the job
257 is unsuspended (the time of unsuspending a job is marked by the Job's
258 `job.status.startTime` field). When the timeout is exceeded, the Kueue's Job
259 Controller suspends the Job corresponding to the workload and puts into the
260 ClusterQueue's `inadmissibleWorkloads` list. The timeout is enforced only when
261 `waitForPodsReady` is enabled.
262
263 ### Test Plan
264
265 
275
276 [x] I understand the owners of the involved components may require updates to
277 existing tests to make this code solid enough prior to committing the changes necessary
278 to implement this enhancement.
279
280 ##### Prerequisite testing updates
281
282 
286
287 We consider the unit test coverage of `pkg/scheduler` and cache `pkg/cache` to
288 be sufficient as a prerequisite for development.
289
290 There is no unit test coverage for the `pkg/controller/workload/job` package,
291 but it is thoroughly tested at the integration level. Some unit tests on the
292 path of creating a workload based on a job, which will be modified in this work,
293 might be added depending on the reviewers feedback.
294
295 #### Unit Tests
296
297 
303
304 
313
314 - `pkg/scheduler`: `25 Nov 2022` - `91.0%`
315 - `pkg/cache`: `25 Nov 2022` - `83.1%`
316 - `pkg/controller/workload/job`: `25 Nov 2022` - `0%`
317
318 #### Integration tests
319
320 The following scenarios will be covered with integration tests when `waitForPodsReady` is enabled:
321 - no workloads are admitted if there is already an admitted workload which is not in the `PodsReady` condition
322 - a workload gets admitted if all other admitted workloads are in the `PodsReady` condition
323 - a workload which exceeds the `waitForPodsReady.timeoutSeconds` timeout is suspended and put into the `inadmissibleWorkloads` list
324
325 
330
331 ### Graduation Criteria
332
333 
348
349 N/A
350
351 ## Implementation History
352
353 
363
364 ## Drawbacks
365
366 Delaying of workload admission until all pods are scheduled may decrease
367 throughput significantly. Especially, if there is enough resource capacity to
368 which could be otherwise used to start multiple jobs at the same time.
369
370 ## Alternatives
371
372 
377
378 #### Delay job start instead of workload admission
379
380 When a workload is nominated its admission is blocked (rejected) until all the
381 already admitted workloads are in the `PodsReady` condition. Instead, we could
382 admit the workload, but delay its job start until the condition is satisfied.
383
384 **Reasons for discarding/deferring**
385
386 It would leak the implementation details of Kueue scheduling to the Kueue job
387 controller.
388
389 #### Pod Resource Reservation
390
391 Pod Resource Reservation (https://docs.google.com/document/d/1sbFUA_9qWtorJkcukNULr12FKX6lMvISiINxAURHNFo/edit#)
392 is another mechanism, currently under discussion, that could ensure all pods get
393 the resources assigned.
394
395 **Reasons for discarding/deferring**
396
397 The mechanism is in early design phase and requires changes to the core Kubernetes,
398 meaning that it is at least 8 months to be available by default in Kubernetes
399 (two release cycles, for Alpha and Beta versions). While this might be a viable
400 long-term solution we aim for a solution which can be adopted by users much
401 earlier. Additionally, in this work we aim to introduce APIs which will be easy
402 to adapt in the future to use a different underlying mechanism.
403
404 #### More granular configuration to enable the mechanism
405
406 Allowing to opt-in for this feature at more granular levels of the
407 Kueue API (Job level, LocalQueue, ClusterQueue, ResourceFlavor) would increase
408 admission throughput.
409
410 One considered option is to enable the feature per Job with a Job annotation,
411 however, it would increase the surface of the Job API.
412
413 Another possibility is to use LocalQueue for defaulting of the opt-in setting
414 for workloads submitted to the local queue. Similarly as in case of Job level
415 the mechanism may not be necessary in case the Job is admitted to a resource
416 flavor which does not require node provisioning. In that case, one can argue the
417 mechanism should neither be opted in at the Job nor LocalQueue level.
418
419 Further, another option is to opt-in to wait for pods ready at the ResourceFlavor
420 level to allow concurrent pod scheduling if the underlying resources don't require
421 provisioning. Here, one concern is that it would make the implementation
422 more involving as ResourceFlavors are assigned during workload admission, so
423 admission would not be blocked, but unsuspending of a Job itself. This could in
424 turn complicate the Kueue's Job Controller, which is responsible for Job
425 unsuspending.
426
427 **Reasons for discarding/deferring**
428
429 The support for the all-or-nothing scheduling is likely to evolve in the future
430 allowing to enable it at more granular levels of the API, however it remains
431 unclear which level would be best to satisfy user needs long-tern. Thus, we want
432 to keep the API commitments small for now.