sigs.k8s.io/kueue@v0.6.2/keps/420-partial-admission/README.md (about)

     1  # KEP-420: Allow partial admission of PodSets
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories](#user-stories)
    10  - [Design Details](#design-details)
    11    - [Workload API](#workload-api)
    12    - [Scheduler / Flavorassignment](#scheduler--flavorassignment)
    13    - [Jobframework](#jobframework)
    14    - [batch/Job controller](#batchjob-controller)
    15    - [kubeflow/MPIJob controller](#kubeflowmpijob-controller)
    16    - [Test Plan](#test-plan)
    17      - [Unit Tests](#unit-tests)
    18      - [Integration tests](#integration-tests)
    19    - [Graduation Criteria](#graduation-criteria)
    20  - [Implementation History](#implementation-history)
    21  - [Drawbacks](#drawbacks)
    22  - [Alternatives](#alternatives)
    23  <!-- /toc -->
    24  
    25  ## Summary
    26  
    27  Add an optional way of allowing the partial admission of a workload if the full admission is not possible.
    28  
    29  ## Motivation
    30  
    31  In practice, not all Workloads require the parallel execution of all the `count` of a `PodSet`, for such cases having a way to partially reserve the quota in order to prevent starvation.
    32  
    33  For example if a batch/Job has parallelism is x and there is only quota available for y < x then the job could still be admitted if it can work with a lower parallelism.
    34  
    35  ### Goals
    36  
    37  Provide an opt-in way for Workloads to accept the admission with a lower count of pods if the full count is not available.
    38  
    39  ### Non-Goals
    40  
    41  Since this is an opt-in feature, the parent job should accept the partial admission parameters provided by Kueue.
    42  
    43  Kueue will not take any measure to ensure that the parent job respects the assigned quota.
    44  
    45  ## Proposal
    46  
    47  Change the way the flavor assigner work to support decrementing the pods count in order to find a better fit for the current workload.
    48  In case a partial fit is chosen, the jobframework reconciler should provide the admitted pod counts to the parent job before unsuspending it, in a similar fashion as the node selectors are provided. In case the job gets suspended, the original pod counts should be restored in order to allow a potential future admission with its original pod counts.
    49  
    50  ### User Stories
    51  
    52  Kueue issue [420](https://github.com/kubernetes-sigs/kueue/issues/420) provides details on the initial feature details and its applicability for `batch/Job`.
    53  
    54  ## Design Details
    55  
    56  ### Workload API
    57  
    58  ```go
    59  type PodSet struct {
    60      // .......
    61  
    62      // count is the number of pods for the spec.
    63      // +kubebuilder:validation:Minimum=1
    64      Count int32 `json:"count"`
    65  
    66      // minimumCount is the minimum number of pods for the spec acceptable
    67      // in case of partial admission.
    68      //
    69      // If not provided, partial admission for the current PodSet is not
    70      // enabled.
    71      // +optional
    72      MinCount *int32 `json:"minCount,omitempty"`
    73  }
    74  
    75  ```
    76  
    77  ### Scheduler / Flavorassignment
    78  
    79  In case the workload proposed for the current scheduling cycle, does not fit, with or without preemption, in the current available quota and any of its PodSets allow partial admission, try to find to find a lower counts combination that fits the available quota with or without borrowing.
    80  
    81  The search should be optimized (binary search) and preserve the proportion of pods lost across the variable count PodSets.
    82  
    83  The accepted number of pods in each PodSet are recorded in `workload.Status.Admission.PodSetAssignments[*].ResourceUsage.Count`
    84  
    85  In order to evaluate the potential success of the preemption, the preemption process should be split in:
    86  - Target selection, which should select the workloads within cohort that will need to be evicted, if the list is empty the assignment will be treated as `NoFit`. This step will take place during nomination.
    87  - Preemption issuing, will do the actual eviction of the targets previously selected.
    88  
    89  **Partial admission with multiple variable PodSets count**
    90  
    91  If multiple PodSets within a workload have variable counts the way the counts should be decrease is highly dependent on the job framework needs, currently the following is proposed:
    92  
    93  Starting with the PodSets:
    94  
    95  ```go
    96  []podSet { {Count: 1} , {Count: 4, MinCount: 2}, {Count: 20, MinCount: 10}}
    97  ```
    98  
    99  And only being able to admit 19 pods, it should end up with
   100  
   101  
   102  ```go
   103  []podSet { {Count: 1} , {Count: 3}, {Count: 15}}
   104  ```
   105  
   106  With both pods that accept partial admission losing 50% of their range.
   107  
   108  The presented solution is not taking into account which of the variable count PodSets is causing the "NoFit" status and can potentially decrease the count of PodSets that will otherwise fit. If this behaviour is not suitable for a specific job framework, the integration layer of the framework can take into account limiting the variable count PodSets to 1.
   109  
   110  ### Jobframework
   111  
   112  ```diff
   113  type GenericJob interface {
   114      // ...
   115  
   116  	
   117  -    // RunWithNodeAffinity will inject the node affinity extracting from workload to job and unsuspend the job.
   118  +    // RunWithPodSetsInfo will inject the node affinity and podsSet counts extracted from workload to the job and unsuspend the job.
   119  -    RunWithNodeAffinity(nodeSelectors []PodSetNodeSelector)
   120  +    RunWithPodSetsInfo(nodeSelectors []PodSetNodeSelector, podSetCounts []int32)
   121  -    // RestoreNodeAffinity will restore the original node affinity of job.
   122  +    // RestorePodSetsInfo will restore the original node affinity of job.
   123  -    RestoreNodeAffinity(nodeSelectors []PodSetNodeSelector)
   124  +    RestorePodSetsInfo(nodeSelectors []PodSetNodeSelector, podSetCounts []int32)
   125  
   126      // ...
   127  }
   128  
   129  ```
   130  
   131  ### batch/Job controller
   132  
   133  Besides adapting `RunWithPodSetsInfo` and `RestorePodSetsInfo` it should also:
   134  
   135  - rework `PodSets()` to populate `MinCount` if the job is marked to support partial admission.
   136    * jobs supporting partial admission should have a dedicated annotation. eg. `kueue.x-k8s.io/job-min-parallelism`, indicating the minimum `parallelism` acceptable by the job in case of partial admission.
   137    * jobs which need the `completions` count kept in sync with `parallelism` should indicate this in a second annotation `kueue.x-k8s.io/job-completions-equal-parallelism`
   138  - rework `EquivalentToWorkload` to account for potential differences in `PodSets` spec `Parallelism`.
   139  
   140  ### kubeflow/MPIJob controller
   141  
   142  In case of MPIJob `j.Spec.RunPolicy.SchedulingPolicy.MinAvailable` can be used to provide a `minimumCount` for the `Worker` PodSets while updating `j.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeWorker].Replicas` before unsuspending the job and after suspending it.
   143  
   144  Whether an MPIJob supports partial admission or not can be deduced based on `MinAvailable` without the need of a dedicated annotation.
   145  Additional research is needed into the potential usage of multiple variable count PodSets.
   146  
   147  ### Test Plan
   148  
   149  No regressions in the current test should be observed.
   150  
   151  [X] I/we understand the owners of the involved components may require updates to
   152  existing tests to make this code solid enough prior to committing the changes necessary
   153  to implement this enhancement.
   154  
   155  #### Unit Tests
   156  
   157  The changes in the flavorassignment should be covered by unit tests
   158  
   159  #### Integration tests
   160  
   161  The `scheduler` and `controllers/job` should be extended to cover the new capabilities.
   162  
   163  ### Graduation Criteria
   164  
   165  
   166  ## Implementation History
   167  
   168  
   169  ## Drawbacks
   170  
   171  
   172  ## Alternatives
   173