sigs.k8s.io/kueue@v0.6.2/keps/349-all-or-nothing/README.md (about)

     1  # KEP-349: All-or-nothing semantics for job resource assignment
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Goals](#goals)
    23    - [Non-Goals](#non-goals)
    24  - [Proposal](#proposal)
    25    - [User Stories (Optional)](#user-stories-optional)
    26      - [Story 1](#story-1)
    27    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    28    - [Risks and Mitigations](#risks-and-mitigations)
    29  - [Design Details](#design-details)
    30    - [Kueue Configuration API](#kueue-configuration-api)
    31    - [PodsReady workload condition](#podsready-workload-condition)
    32    - [Waiting for PodsReady condition](#waiting-for-podsready-condition)
    33    - [Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)
    34    - [Test Plan](#test-plan)
    35        - [Prerequisite testing updates](#prerequisite-testing-updates)
    36      - [Unit Tests](#unit-tests)
    37      - [Integration tests](#integration-tests)
    38    - [Graduation Criteria](#graduation-criteria)
    39  - [Implementation History](#implementation-history)
    40  - [Drawbacks](#drawbacks)
    41  - [Alternatives](#alternatives)
    42      - [Delay job start instead of workload admission](#delay-job-start-instead-of-workload-admission)
    43      - [Pod Resource Reservation](#pod-resource-reservation)
    44      - [More granular configuration to enable the mechanism](#more-granular-configuration-to-enable-the-mechanism)
    45  <!-- /toc -->
    46  
    47  ## Summary
    48  
    49  This proposal introduces an opt-in mechanism to ensure that a job gets the
    50  physical resources assigned once unsuspended by Kueue.
    51  
    52  <!--
    53  This section is incredibly important for producing high-quality, user-focused
    54  documentation such as release notes or a development roadmap. It should be
    55  possible to collect this information before implementation begins, in order to
    56  avoid requiring implementors to split their attention between writing release
    57  notes and implementing the feature itself. KEP editors and SIG Docs
    58  should help to ensure that the tone and content of the `Summary` section is
    59  useful for a wide audience.
    60  
    61  A good summary is probably at least a paragraph in length.
    62  
    63  Both in this section and below, follow the guidelines of the [documentation
    64  style guide]. In particular, wrap lines to a reasonable length, to make it
    65  easier for reviewers to cite specific portions, and to minimize diff churn on
    66  updates.
    67  
    68  [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
    69  -->
    70  
    71  ## Motivation
    72  
    73  Some jobs need all pods to be running at the same time to make progress, for
    74  example, when they require pod-to-pod communication. In that case a pair of
    75  large jobs may deadlock if there are issues with resource provisioning to
    76  match the configured cluster quota. The same pair of jobs could run to
    77  completion if their pods were scheduled sequentially.
    78  
    79  <!--
    80  This section is for explicitly listing the motivation, goals, and non-goals of
    81  this KEP.  Describe why the change is important and the benefits to users. The
    82  motivation section can optionally provide links to [experience reports] to
    83  demonstrate the interest in a KEP within the wider Kubernetes community.
    84  
    85  [experience reports]: https://github.com/golang/go/wiki/ExperienceReports
    86  -->
    87  
    88  ### Goals
    89  
    90  - a mechanism to ensure that a job gets assigned physical resources when
    91  unsuspended by Kueue
    92  - a timeout on getting the physical resources assigned by a Job since
    93  unsuspended by Kueue
    94  
    95  <!--
    96  List the specific goals of the KEP. What is it trying to achieve? How will we
    97  know that this has succeeded?
    98  -->
    99  
   100  ### Non-Goals
   101  
   102  - guarantee that two jobs would not schedule pods concurrently. Example
   103  scenarios in which two jobs may still concurrently schedule their pods:
   104    - when succeeded pods are replaced with new because job's parallelism is less than its completions;
   105    - when a failed pod gets replaced
   106  
   107  <!--
   108  What is out of scope for this KEP? Listing non-goals helps to focus discussion
   109  and make progress.
   110  -->
   111  
   112  ## Proposal
   113  
   114  We introduce a mechanism to ensure jobs get their physical resources
   115  assigned by avoiding concurrent scheduling of their pods. More precisely, we
   116  block admission of new workloads until the first batch of pods for the
   117  unsuspended job is scheduled. This behavior can be opted-in at the level of
   118  the Kueue configuration.
   119  
   120  <!--
   121  This is where we get down to the specifics of what the proposal actually is.
   122  This should have enough detail that reviewers can understand exactly what
   123  you're proposing, but should not include things like API designs or
   124  implementation. What is the desired outcome and how do we measure success?.
   125  The "Design Details" section below is for the real
   126  nitty-gritty.
   127  -->
   128  
   129  ### User Stories (Optional)
   130  
   131  <!--
   132  Detail the things that people will be able to do if this KEP is implemented.
   133  Include as much detail as possible so that people can understand the "how" of
   134  the system. The goal here is to make this feel real for users without getting
   135  bogged down.
   136  -->
   137  
   138  #### Story 1
   139  
   140  As a Kueue administrator I want to ensure that two or more Jobs, which require
   141  all pods to be running at the same time, would not deadlock when scheduling
   142  their pods. This could happen in case of node provisioning issues to match
   143  the configured cluster queue quota and when the Jobs don't specify priorities
   144  (or specify the same priority).
   145  
   146  My use case can be supported by enabling `waitForPodsReady` in the Kueue
   147  configuration.
   148  
   149  ### Notes/Constraints/Caveats (Optional)
   150  
   151  <!--
   152  What are the caveats to the proposal?
   153  What are some important details that didn't come across above?
   154  Go in to as much detail as necessary here.
   155  This might be a good place to talk about core concepts and how they relate.
   156  -->
   157  
   158  ### Risks and Mitigations
   159  
   160  If a workload fails to schedule its pods it could block admission of other
   161  workloads indefinitely.
   162  
   163  To mitigate this issue we introduce a timeout on reaching the `PodsReady`
   164  condition by a workload since its job start (see:
   165  [Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)).
   166  
   167  <!--
   168  What are the risks of this proposal, and how do we mitigate? Think broadly.
   169  For example, consider both security and how this will impact the larger
   170  Kubernetes ecosystem.
   171  
   172  How will security be reviewed, and by whom?
   173  
   174  How will UX be reviewed, and by whom?
   175  
   176  Consider including folks who also work outside the SIG or subproject.
   177  -->
   178  
   179  ## Design Details
   180  
   181  <!--
   182  This section should contain enough information that the specifics of your
   183  change are understandable. This may include API specs (though not always
   184  required) or even code snippets. If there's any ambiguity about HOW your
   185  proposal will be implemented, this is the place to discuss them.
   186  -->
   187  
   188  ### Kueue Configuration API
   189  
   190  We extend the global Kueue Configuration API to introduce the new fields:
   191  `waitForPodsReady` to opt-in and configure the new behavior.
   192  
   193  ```golang
   194  // Configuration is the Schema for the kueueconfigurations API
   195  type Configuration struct {
   196    ...
   197  	// WaitForPodsReady is configuration for waitForPodsReady
   198  	WaitForPodsReady *WaitForPodsReady `json:"waitForPodsReady,omitempty"`
   199  }
   200  
   201  type WaitForPodsReady struct {
   202  	// Enable when true, indicates that each admitted workload
   203  	// blocks admission of other workloads in the cluster, until it is in the
   204  	// `PodsReady` condition. If false, all workloads start as soon as they are
   205  	// admitted and do not block admission of other workloads. The PodsReady
   206  	// condition is only added if this setting is enabled. If unspecified,
   207  	// it defaults to false.
   208  	Enable *bool `json:"enable,omitempty"`
   209  
   210  	// timeoutSeconds defines optional time duration in seconds, relative to the
   211  	// job.status.StartTime, it can take an admitted workload to reach
   212  	// the PodsReady condition.
   213  	// After exceeding the timeout the corresponding job gets suspended again
   214  	// and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is
   215  	// enforced only if waitForPodsReady.enable=true. If unspecified, it defaults to 5min.
   216  	// +optional
   217  	TimeoutSeconds *int64 `json:"timeoutSeconds,omitempty"`
   218  }
   219  
   220  ```
   221  
   222  ### PodsReady workload condition
   223  
   224  We introduce a new workload condition, called `PodsReady`, to indicate
   225  if the workload's startup requirements are satisfied. More precisely, we add
   226  the condition when `job.status.ready + job.status.succeeded` is greater or equal
   227  than `job.spec.parallelism`.
   228  
   229  Note that, we don't take failed pods into account when verifying if the
   230  `PodsReady` condition should be added. However, a buggy admitted workload is
   231  eliminated as the corresponding job fails due to exceeding the `.spec.backoffLimit`
   232  limit.
   233  
   234  The `PodsReady` condition is added to the workload by the Kueue's Job
   235  Controller in reaction to a status update of the corresponding Job. Note that,
   236  verifying if the condition should be added does not require an extra API call as
   237  the Kueue's Job Controller already fetches the latest Job object at the
   238  beginning of the `Reconcile` function.
   239  
   240  This condition is added only when `waitForPodsReady` is enabled in the
   241  Kueue configuration.
   242  
   243  ### Waiting for PodsReady condition
   244  
   245  When the mechanism is enabled, for each admitted workload Kueue's scheduler
   246  blocks admission of queued workloads until the workload has the `PodsReady`
   247  condition. Kueue's scheduler verifies the workload state by a lookup to the
   248  cache of admitted workloads.
   249  
   250  Note that, because the mechanism is enabled for all workloads, when a workload
   251  gets admitted, all other admitted workloads are already in the `PodsReady`
   252  condition, so the corresponding job is unsuspended without further waiting.
   253  
   254  ### Timeout on reaching the PodsReady condition
   255  
   256  We introduce a timeout, defined in the `waitForPodsReady.timeoutSeconds` field, on reaching the `PodsReady` condition since the job
   257  is unsuspended (the time of unsuspending a job is marked by the Job's
   258  `job.status.startTime` field). When the timeout is exceeded, the Kueue's Job
   259  Controller suspends the Job corresponding to the workload and puts into the
   260  ClusterQueue's `inadmissibleWorkloads` list. The timeout is enforced only when
   261  `waitForPodsReady` is enabled.
   262  
   263  ### Test Plan
   264  
   265  <!--
   266  **Note:** *Not required until targeted at a release.*
   267  The goal is to ensure that we don't accept enhancements with inadequate testing.
   268  
   269  All code is expected to have adequate tests (eventually with coverage
   270  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   271  when drafting this test plan.
   272  
   273  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   274  -->
   275  
   276  [x] I understand the owners of the involved components may require updates to
   277  existing tests to make this code solid enough prior to committing the changes necessary
   278  to implement this enhancement.
   279  
   280  ##### Prerequisite testing updates
   281  
   282  <!--
   283  Based on reviewers feedback describe what additional tests need to be added prior
   284  implementing this enhancement to ensure the enhancements have also solid foundations.
   285  -->
   286  
   287  We consider the unit test coverage of `pkg/scheduler` and cache `pkg/cache` to
   288  be sufficient as a prerequisite for development.
   289  
   290  There is no unit test coverage for the `pkg/controller/workload/job` package,
   291  but it is thoroughly tested at the integration level. Some unit tests on the
   292  path of creating a workload based on a job, which will be modified in this work,
   293  might be added depending on the reviewers feedback.
   294  
   295  #### Unit Tests
   296  
   297  <!--
   298  In principle every added code should have complete unit test coverage, so providing
   299  the exact set of tests will not bring additional value.
   300  However, if complete unit test coverage is not possible, explain the reason of it
   301  together with explanation why this is acceptable.
   302  -->
   303  
   304  <!--
   305  Additionally, try to enumerate the core package you will be touching
   306  to implement this enhancement and provide the current unit coverage for those
   307  in the form of:
   308  - <package>: <date> - <current test coverage>
   309  
   310  This can inform certain test coverage improvements that we want to do before
   311  extending the production code to implement this enhancement.
   312  -->
   313  
   314  - `pkg/scheduler`: `25 Nov 2022` - `91.0%`
   315  - `pkg/cache`: `25 Nov 2022` - `83.1%`
   316  - `pkg/controller/workload/job`: `25 Nov 2022` - `0%`
   317  
   318  #### Integration tests
   319  
   320  The following scenarios will be covered with integration tests when `waitForPodsReady` is enabled:
   321  - no workloads are admitted if there is already an admitted workload which is not in the `PodsReady` condition
   322  - a workload gets admitted if all other admitted workloads are in the `PodsReady` condition
   323  - a workload which exceeds the `waitForPodsReady.timeoutSeconds` timeout is suspended and put into the `inadmissibleWorkloads` list
   324  
   325  <!--
   326  Describe what tests will be added to ensure proper quality of the enhancement.
   327  
   328  After the implementation PR is merged, add the names of the tests here.
   329  -->
   330  
   331  ### Graduation Criteria
   332  
   333  <!--
   334  
   335  Clearly define what it means for the feature to be implemented and
   336  considered stable.
   337  
   338  If the feature you are introducing has high complexity, consider adding graduation
   339  milestones with these graduation criteria:
   340  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   341  - [Feature gate][feature gate] lifecycle
   342  - [Deprecation policy][deprecation-policy]
   343  
   344  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   345  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   346  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   347  -->
   348  
   349  N/A
   350  
   351  ## Implementation History
   352  
   353  <!--
   354  Major milestones in the lifecycle of a KEP should be tracked in this section.
   355  Major milestones might include:
   356  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   357  - the `Proposal` section being merged, signaling agreement on a proposed design
   358  - the date implementation started
   359  - the first Kubernetes release where an initial version of the KEP was available
   360  - the version of Kubernetes where the KEP graduated to general availability
   361  - when the KEP was retired or superseded
   362  -->
   363  
   364  ## Drawbacks
   365  
   366  Delaying of workload admission until all pods are scheduled may decrease
   367  throughput significantly. Especially, if there is enough resource capacity to
   368  which could be otherwise used to start multiple jobs at the same time.
   369  
   370  ## Alternatives
   371  
   372  <!--
   373  What other approaches did you consider, and why did you rule them out? These do
   374  not need to be as detailed as the proposal, but should include enough
   375  information to express the idea and why it was not acceptable.
   376  -->
   377  
   378  #### Delay job start instead of workload admission
   379  
   380  When a workload is nominated its admission is blocked (rejected) until all the
   381  already admitted workloads are in the `PodsReady` condition. Instead, we could
   382  admit the workload, but delay its job start until the condition is satisfied.
   383  
   384  **Reasons for discarding/deferring**
   385  
   386  It would leak the implementation details of Kueue scheduling to the Kueue job
   387  controller.
   388  
   389  #### Pod Resource Reservation
   390  
   391  Pod Resource Reservation (https://docs.google.com/document/d/1sbFUA_9qWtorJkcukNULr12FKX6lMvISiINxAURHNFo/edit#)
   392  is another mechanism, currently under discussion, that could ensure all pods get
   393  the resources assigned.
   394  
   395  **Reasons for discarding/deferring**
   396  
   397  The mechanism is in early design phase and requires changes to the core Kubernetes,
   398  meaning that it is at least 8 months to be available by default in Kubernetes
   399  (two release cycles, for Alpha and Beta versions). While this might be a viable
   400  long-term solution we aim for a solution which can be adopted by users much
   401  earlier. Additionally, in this work we aim to introduce APIs which will be easy
   402  to adapt in the future to use a different underlying mechanism.
   403  
   404  #### More granular configuration to enable the mechanism
   405  
   406  Allowing to opt-in for this feature at more granular levels of the
   407  Kueue API (Job level, LocalQueue, ClusterQueue, ResourceFlavor) would increase
   408  admission throughput.
   409  
   410  One considered option is to enable the feature per Job with a Job annotation,
   411  however, it would increase the surface of the Job API.
   412  
   413  Another possibility is to use LocalQueue for defaulting of the opt-in setting
   414  for workloads submitted to the local queue. Similarly as in case of Job level
   415  the mechanism may not be necessary in case the Job is admitted to a resource
   416  flavor which does not require node provisioning. In that case, one can argue the
   417  mechanism should neither be opted in at the Job nor LocalQueue level.
   418  
   419  Further, another option is to opt-in to wait for pods ready at the ResourceFlavor
   420  level to allow concurrent pod scheduling if the underlying resources don't require
   421  provisioning. Here, one concern is that it would make the implementation
   422  more involving as ResourceFlavors are assigned during workload admission, so
   423  admission would not be blocked, but unsuspending of a Job itself. This could in
   424  turn complicate the Kueue's Job Controller, which is responsible for Job
   425  unsuspending.
   426  
   427  **Reasons for discarding/deferring**
   428  
   429  The support for the all-or-nothing scheduling is likely to evolve in the future
   430  allowing to enable it at more granular levels of the API, however it remains
   431  unclear which level would be best to satisfy user needs long-tern. Thus, we want
   432  to keep the API commitments small for now.