sigs.k8s.io/kueue@v0.6.2/keps/349-all-or-nothing/README.md (about) 1 # KEP-349: All-or-nothing semantics for job resource assignment 2 3 <!-- 4 This is the title of your KEP. Keep it short, simple, and descriptive. A good 5 title can help communicate what the KEP is and should be considered as part of 6 any review. 7 --> 8 9 <!-- 10 A table of contents is helpful for quickly jumping to sections of a KEP and for 11 highlighting any additional information provided beyond the standard KEP 12 template. 13 14 Ensure the TOC is wrapped with 15 <code><!-- toc --&rt;<!-- /toc --&rt;</code> 16 tags, and then generate with `hack/update-toc.sh`. 17 --> 18 19 <!-- toc --> 20 - [Summary](#summary) 21 - [Motivation](#motivation) 22 - [Goals](#goals) 23 - [Non-Goals](#non-goals) 24 - [Proposal](#proposal) 25 - [User Stories (Optional)](#user-stories-optional) 26 - [Story 1](#story-1) 27 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 28 - [Risks and Mitigations](#risks-and-mitigations) 29 - [Design Details](#design-details) 30 - [Kueue Configuration API](#kueue-configuration-api) 31 - [PodsReady workload condition](#podsready-workload-condition) 32 - [Waiting for PodsReady condition](#waiting-for-podsready-condition) 33 - [Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition) 34 - [Test Plan](#test-plan) 35 - [Prerequisite testing updates](#prerequisite-testing-updates) 36 - [Unit Tests](#unit-tests) 37 - [Integration tests](#integration-tests) 38 - [Graduation Criteria](#graduation-criteria) 39 - [Implementation History](#implementation-history) 40 - [Drawbacks](#drawbacks) 41 - [Alternatives](#alternatives) 42 - [Delay job start instead of workload admission](#delay-job-start-instead-of-workload-admission) 43 - [Pod Resource Reservation](#pod-resource-reservation) 44 - [More granular configuration to enable the mechanism](#more-granular-configuration-to-enable-the-mechanism) 45 <!-- /toc --> 46 47 ## Summary 48 49 This proposal introduces an opt-in mechanism to ensure that a job gets the 50 physical resources assigned once unsuspended by Kueue. 51 52 <!-- 53 This section is incredibly important for producing high-quality, user-focused 54 documentation such as release notes or a development roadmap. It should be 55 possible to collect this information before implementation begins, in order to 56 avoid requiring implementors to split their attention between writing release 57 notes and implementing the feature itself. KEP editors and SIG Docs 58 should help to ensure that the tone and content of the `Summary` section is 59 useful for a wide audience. 60 61 A good summary is probably at least a paragraph in length. 62 63 Both in this section and below, follow the guidelines of the [documentation 64 style guide]. In particular, wrap lines to a reasonable length, to make it 65 easier for reviewers to cite specific portions, and to minimize diff churn on 66 updates. 67 68 [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md 69 --> 70 71 ## Motivation 72 73 Some jobs need all pods to be running at the same time to make progress, for 74 example, when they require pod-to-pod communication. In that case a pair of 75 large jobs may deadlock if there are issues with resource provisioning to 76 match the configured cluster quota. The same pair of jobs could run to 77 completion if their pods were scheduled sequentially. 78 79 <!-- 80 This section is for explicitly listing the motivation, goals, and non-goals of 81 this KEP. Describe why the change is important and the benefits to users. The 82 motivation section can optionally provide links to [experience reports] to 83 demonstrate the interest in a KEP within the wider Kubernetes community. 84 85 [experience reports]: https://github.com/golang/go/wiki/ExperienceReports 86 --> 87 88 ### Goals 89 90 - a mechanism to ensure that a job gets assigned physical resources when 91 unsuspended by Kueue 92 - a timeout on getting the physical resources assigned by a Job since 93 unsuspended by Kueue 94 95 <!-- 96 List the specific goals of the KEP. What is it trying to achieve? How will we 97 know that this has succeeded? 98 --> 99 100 ### Non-Goals 101 102 - guarantee that two jobs would not schedule pods concurrently. Example 103 scenarios in which two jobs may still concurrently schedule their pods: 104 - when succeeded pods are replaced with new because job's parallelism is less than its completions; 105 - when a failed pod gets replaced 106 107 <!-- 108 What is out of scope for this KEP? Listing non-goals helps to focus discussion 109 and make progress. 110 --> 111 112 ## Proposal 113 114 We introduce a mechanism to ensure jobs get their physical resources 115 assigned by avoiding concurrent scheduling of their pods. More precisely, we 116 block admission of new workloads until the first batch of pods for the 117 unsuspended job is scheduled. This behavior can be opted-in at the level of 118 the Kueue configuration. 119 120 <!-- 121 This is where we get down to the specifics of what the proposal actually is. 122 This should have enough detail that reviewers can understand exactly what 123 you're proposing, but should not include things like API designs or 124 implementation. What is the desired outcome and how do we measure success?. 125 The "Design Details" section below is for the real 126 nitty-gritty. 127 --> 128 129 ### User Stories (Optional) 130 131 <!-- 132 Detail the things that people will be able to do if this KEP is implemented. 133 Include as much detail as possible so that people can understand the "how" of 134 the system. The goal here is to make this feel real for users without getting 135 bogged down. 136 --> 137 138 #### Story 1 139 140 As a Kueue administrator I want to ensure that two or more Jobs, which require 141 all pods to be running at the same time, would not deadlock when scheduling 142 their pods. This could happen in case of node provisioning issues to match 143 the configured cluster queue quota and when the Jobs don't specify priorities 144 (or specify the same priority). 145 146 My use case can be supported by enabling `waitForPodsReady` in the Kueue 147 configuration. 148 149 ### Notes/Constraints/Caveats (Optional) 150 151 <!-- 152 What are the caveats to the proposal? 153 What are some important details that didn't come across above? 154 Go in to as much detail as necessary here. 155 This might be a good place to talk about core concepts and how they relate. 156 --> 157 158 ### Risks and Mitigations 159 160 If a workload fails to schedule its pods it could block admission of other 161 workloads indefinitely. 162 163 To mitigate this issue we introduce a timeout on reaching the `PodsReady` 164 condition by a workload since its job start (see: 165 [Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)). 166 167 <!-- 168 What are the risks of this proposal, and how do we mitigate? Think broadly. 169 For example, consider both security and how this will impact the larger 170 Kubernetes ecosystem. 171 172 How will security be reviewed, and by whom? 173 174 How will UX be reviewed, and by whom? 175 176 Consider including folks who also work outside the SIG or subproject. 177 --> 178 179 ## Design Details 180 181 <!-- 182 This section should contain enough information that the specifics of your 183 change are understandable. This may include API specs (though not always 184 required) or even code snippets. If there's any ambiguity about HOW your 185 proposal will be implemented, this is the place to discuss them. 186 --> 187 188 ### Kueue Configuration API 189 190 We extend the global Kueue Configuration API to introduce the new fields: 191 `waitForPodsReady` to opt-in and configure the new behavior. 192 193 ```golang 194 // Configuration is the Schema for the kueueconfigurations API 195 type Configuration struct { 196 ... 197 // WaitForPodsReady is configuration for waitForPodsReady 198 WaitForPodsReady *WaitForPodsReady `json:"waitForPodsReady,omitempty"` 199 } 200 201 type WaitForPodsReady struct { 202 // Enable when true, indicates that each admitted workload 203 // blocks admission of other workloads in the cluster, until it is in the 204 // `PodsReady` condition. If false, all workloads start as soon as they are 205 // admitted and do not block admission of other workloads. The PodsReady 206 // condition is only added if this setting is enabled. If unspecified, 207 // it defaults to false. 208 Enable *bool `json:"enable,omitempty"` 209 210 // timeoutSeconds defines optional time duration in seconds, relative to the 211 // job.status.StartTime, it can take an admitted workload to reach 212 // the PodsReady condition. 213 // After exceeding the timeout the corresponding job gets suspended again 214 // and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is 215 // enforced only if waitForPodsReady.enable=true. If unspecified, it defaults to 5min. 216 // +optional 217 TimeoutSeconds *int64 `json:"timeoutSeconds,omitempty"` 218 } 219 220 ``` 221 222 ### PodsReady workload condition 223 224 We introduce a new workload condition, called `PodsReady`, to indicate 225 if the workload's startup requirements are satisfied. More precisely, we add 226 the condition when `job.status.ready + job.status.succeeded` is greater or equal 227 than `job.spec.parallelism`. 228 229 Note that, we don't take failed pods into account when verifying if the 230 `PodsReady` condition should be added. However, a buggy admitted workload is 231 eliminated as the corresponding job fails due to exceeding the `.spec.backoffLimit` 232 limit. 233 234 The `PodsReady` condition is added to the workload by the Kueue's Job 235 Controller in reaction to a status update of the corresponding Job. Note that, 236 verifying if the condition should be added does not require an extra API call as 237 the Kueue's Job Controller already fetches the latest Job object at the 238 beginning of the `Reconcile` function. 239 240 This condition is added only when `waitForPodsReady` is enabled in the 241 Kueue configuration. 242 243 ### Waiting for PodsReady condition 244 245 When the mechanism is enabled, for each admitted workload Kueue's scheduler 246 blocks admission of queued workloads until the workload has the `PodsReady` 247 condition. Kueue's scheduler verifies the workload state by a lookup to the 248 cache of admitted workloads. 249 250 Note that, because the mechanism is enabled for all workloads, when a workload 251 gets admitted, all other admitted workloads are already in the `PodsReady` 252 condition, so the corresponding job is unsuspended without further waiting. 253 254 ### Timeout on reaching the PodsReady condition 255 256 We introduce a timeout, defined in the `waitForPodsReady.timeoutSeconds` field, on reaching the `PodsReady` condition since the job 257 is unsuspended (the time of unsuspending a job is marked by the Job's 258 `job.status.startTime` field). When the timeout is exceeded, the Kueue's Job 259 Controller suspends the Job corresponding to the workload and puts into the 260 ClusterQueue's `inadmissibleWorkloads` list. The timeout is enforced only when 261 `waitForPodsReady` is enabled. 262 263 ### Test Plan 264 265 <!-- 266 **Note:** *Not required until targeted at a release.* 267 The goal is to ensure that we don't accept enhancements with inadequate testing. 268 269 All code is expected to have adequate tests (eventually with coverage 270 expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] 271 when drafting this test plan. 272 273 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 274 --> 275 276 [x] I understand the owners of the involved components may require updates to 277 existing tests to make this code solid enough prior to committing the changes necessary 278 to implement this enhancement. 279 280 ##### Prerequisite testing updates 281 282 <!-- 283 Based on reviewers feedback describe what additional tests need to be added prior 284 implementing this enhancement to ensure the enhancements have also solid foundations. 285 --> 286 287 We consider the unit test coverage of `pkg/scheduler` and cache `pkg/cache` to 288 be sufficient as a prerequisite for development. 289 290 There is no unit test coverage for the `pkg/controller/workload/job` package, 291 but it is thoroughly tested at the integration level. Some unit tests on the 292 path of creating a workload based on a job, which will be modified in this work, 293 might be added depending on the reviewers feedback. 294 295 #### Unit Tests 296 297 <!-- 298 In principle every added code should have complete unit test coverage, so providing 299 the exact set of tests will not bring additional value. 300 However, if complete unit test coverage is not possible, explain the reason of it 301 together with explanation why this is acceptable. 302 --> 303 304 <!-- 305 Additionally, try to enumerate the core package you will be touching 306 to implement this enhancement and provide the current unit coverage for those 307 in the form of: 308 - <package>: <date> - <current test coverage> 309 310 This can inform certain test coverage improvements that we want to do before 311 extending the production code to implement this enhancement. 312 --> 313 314 - `pkg/scheduler`: `25 Nov 2022` - `91.0%` 315 - `pkg/cache`: `25 Nov 2022` - `83.1%` 316 - `pkg/controller/workload/job`: `25 Nov 2022` - `0%` 317 318 #### Integration tests 319 320 The following scenarios will be covered with integration tests when `waitForPodsReady` is enabled: 321 - no workloads are admitted if there is already an admitted workload which is not in the `PodsReady` condition 322 - a workload gets admitted if all other admitted workloads are in the `PodsReady` condition 323 - a workload which exceeds the `waitForPodsReady.timeoutSeconds` timeout is suspended and put into the `inadmissibleWorkloads` list 324 325 <!-- 326 Describe what tests will be added to ensure proper quality of the enhancement. 327 328 After the implementation PR is merged, add the names of the tests here. 329 --> 330 331 ### Graduation Criteria 332 333 <!-- 334 335 Clearly define what it means for the feature to be implemented and 336 considered stable. 337 338 If the feature you are introducing has high complexity, consider adding graduation 339 milestones with these graduation criteria: 340 - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] 341 - [Feature gate][feature gate] lifecycle 342 - [Deprecation policy][deprecation-policy] 343 344 [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md 345 [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions 346 [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ 347 --> 348 349 N/A 350 351 ## Implementation History 352 353 <!-- 354 Major milestones in the lifecycle of a KEP should be tracked in this section. 355 Major milestones might include: 356 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 357 - the `Proposal` section being merged, signaling agreement on a proposed design 358 - the date implementation started 359 - the first Kubernetes release where an initial version of the KEP was available 360 - the version of Kubernetes where the KEP graduated to general availability 361 - when the KEP was retired or superseded 362 --> 363 364 ## Drawbacks 365 366 Delaying of workload admission until all pods are scheduled may decrease 367 throughput significantly. Especially, if there is enough resource capacity to 368 which could be otherwise used to start multiple jobs at the same time. 369 370 ## Alternatives 371 372 <!-- 373 What other approaches did you consider, and why did you rule them out? These do 374 not need to be as detailed as the proposal, but should include enough 375 information to express the idea and why it was not acceptable. 376 --> 377 378 #### Delay job start instead of workload admission 379 380 When a workload is nominated its admission is blocked (rejected) until all the 381 already admitted workloads are in the `PodsReady` condition. Instead, we could 382 admit the workload, but delay its job start until the condition is satisfied. 383 384 **Reasons for discarding/deferring** 385 386 It would leak the implementation details of Kueue scheduling to the Kueue job 387 controller. 388 389 #### Pod Resource Reservation 390 391 Pod Resource Reservation (https://docs.google.com/document/d/1sbFUA_9qWtorJkcukNULr12FKX6lMvISiINxAURHNFo/edit#) 392 is another mechanism, currently under discussion, that could ensure all pods get 393 the resources assigned. 394 395 **Reasons for discarding/deferring** 396 397 The mechanism is in early design phase and requires changes to the core Kubernetes, 398 meaning that it is at least 8 months to be available by default in Kubernetes 399 (two release cycles, for Alpha and Beta versions). While this might be a viable 400 long-term solution we aim for a solution which can be adopted by users much 401 earlier. Additionally, in this work we aim to introduce APIs which will be easy 402 to adapt in the future to use a different underlying mechanism. 403 404 #### More granular configuration to enable the mechanism 405 406 Allowing to opt-in for this feature at more granular levels of the 407 Kueue API (Job level, LocalQueue, ClusterQueue, ResourceFlavor) would increase 408 admission throughput. 409 410 One considered option is to enable the feature per Job with a Job annotation, 411 however, it would increase the surface of the Job API. 412 413 Another possibility is to use LocalQueue for defaulting of the opt-in setting 414 for workloads submitted to the local queue. Similarly as in case of Job level 415 the mechanism may not be necessary in case the Job is admitted to a resource 416 flavor which does not require node provisioning. In that case, one can argue the 417 mechanism should neither be opted in at the Job nor LocalQueue level. 418 419 Further, another option is to opt-in to wait for pods ready at the ResourceFlavor 420 level to allow concurrent pod scheduling if the underlying resources don't require 421 provisioning. Here, one concern is that it would make the implementation 422 more involving as ResourceFlavors are assigned during workload admission, so 423 admission would not be blocked, but unsuspending of a Job itself. This could in 424 turn complicate the Kueue's Job Controller, which is responsible for Job 425 unsuspending. 426 427 **Reasons for discarding/deferring** 428 429 The support for the all-or-nothing scheduling is likely to evolve in the future 430 allowing to enable it at more granular levels of the API, however it remains 431 unclear which level would be best to satisfy user needs long-tern. Thus, we want 432 to keep the API commitments small for now.