sigs.k8s.io/kueue@v0.6.2/keps/420-partial-admission/README.md (about) 1 # KEP-420: Allow partial admission of PodSets 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories](#user-stories) 10 - [Design Details](#design-details) 11 - [Workload API](#workload-api) 12 - [Scheduler / Flavorassignment](#scheduler--flavorassignment) 13 - [Jobframework](#jobframework) 14 - [batch/Job controller](#batchjob-controller) 15 - [kubeflow/MPIJob controller](#kubeflowmpijob-controller) 16 - [Test Plan](#test-plan) 17 - [Unit Tests](#unit-tests) 18 - [Integration tests](#integration-tests) 19 - [Graduation Criteria](#graduation-criteria) 20 - [Implementation History](#implementation-history) 21 - [Drawbacks](#drawbacks) 22 - [Alternatives](#alternatives) 23 <!-- /toc --> 24 25 ## Summary 26 27 Add an optional way of allowing the partial admission of a workload if the full admission is not possible. 28 29 ## Motivation 30 31 In practice, not all Workloads require the parallel execution of all the `count` of a `PodSet`, for such cases having a way to partially reserve the quota in order to prevent starvation. 32 33 For example if a batch/Job has parallelism is x and there is only quota available for y < x then the job could still be admitted if it can work with a lower parallelism. 34 35 ### Goals 36 37 Provide an opt-in way for Workloads to accept the admission with a lower count of pods if the full count is not available. 38 39 ### Non-Goals 40 41 Since this is an opt-in feature, the parent job should accept the partial admission parameters provided by Kueue. 42 43 Kueue will not take any measure to ensure that the parent job respects the assigned quota. 44 45 ## Proposal 46 47 Change the way the flavor assigner work to support decrementing the pods count in order to find a better fit for the current workload. 48 In case a partial fit is chosen, the jobframework reconciler should provide the admitted pod counts to the parent job before unsuspending it, in a similar fashion as the node selectors are provided. In case the job gets suspended, the original pod counts should be restored in order to allow a potential future admission with its original pod counts. 49 50 ### User Stories 51 52 Kueue issue [420](https://github.com/kubernetes-sigs/kueue/issues/420) provides details on the initial feature details and its applicability for `batch/Job`. 53 54 ## Design Details 55 56 ### Workload API 57 58 ```go 59 type PodSet struct { 60 // ....... 61 62 // count is the number of pods for the spec. 63 // +kubebuilder:validation:Minimum=1 64 Count int32 `json:"count"` 65 66 // minimumCount is the minimum number of pods for the spec acceptable 67 // in case of partial admission. 68 // 69 // If not provided, partial admission for the current PodSet is not 70 // enabled. 71 // +optional 72 MinCount *int32 `json:"minCount,omitempty"` 73 } 74 75 ``` 76 77 ### Scheduler / Flavorassignment 78 79 In case the workload proposed for the current scheduling cycle, does not fit, with or without preemption, in the current available quota and any of its PodSets allow partial admission, try to find to find a lower counts combination that fits the available quota with or without borrowing. 80 81 The search should be optimized (binary search) and preserve the proportion of pods lost across the variable count PodSets. 82 83 The accepted number of pods in each PodSet are recorded in `workload.Status.Admission.PodSetAssignments[*].ResourceUsage.Count` 84 85 In order to evaluate the potential success of the preemption, the preemption process should be split in: 86 - Target selection, which should select the workloads within cohort that will need to be evicted, if the list is empty the assignment will be treated as `NoFit`. This step will take place during nomination. 87 - Preemption issuing, will do the actual eviction of the targets previously selected. 88 89 **Partial admission with multiple variable PodSets count** 90 91 If multiple PodSets within a workload have variable counts the way the counts should be decrease is highly dependent on the job framework needs, currently the following is proposed: 92 93 Starting with the PodSets: 94 95 ```go 96 []podSet { {Count: 1} , {Count: 4, MinCount: 2}, {Count: 20, MinCount: 10}} 97 ``` 98 99 And only being able to admit 19 pods, it should end up with 100 101 102 ```go 103 []podSet { {Count: 1} , {Count: 3}, {Count: 15}} 104 ``` 105 106 With both pods that accept partial admission losing 50% of their range. 107 108 The presented solution is not taking into account which of the variable count PodSets is causing the "NoFit" status and can potentially decrease the count of PodSets that will otherwise fit. If this behaviour is not suitable for a specific job framework, the integration layer of the framework can take into account limiting the variable count PodSets to 1. 109 110 ### Jobframework 111 112 ```diff 113 type GenericJob interface { 114 // ... 115 116 117 - // RunWithNodeAffinity will inject the node affinity extracting from workload to job and unsuspend the job. 118 + // RunWithPodSetsInfo will inject the node affinity and podsSet counts extracted from workload to the job and unsuspend the job. 119 - RunWithNodeAffinity(nodeSelectors []PodSetNodeSelector) 120 + RunWithPodSetsInfo(nodeSelectors []PodSetNodeSelector, podSetCounts []int32) 121 - // RestoreNodeAffinity will restore the original node affinity of job. 122 + // RestorePodSetsInfo will restore the original node affinity of job. 123 - RestoreNodeAffinity(nodeSelectors []PodSetNodeSelector) 124 + RestorePodSetsInfo(nodeSelectors []PodSetNodeSelector, podSetCounts []int32) 125 126 // ... 127 } 128 129 ``` 130 131 ### batch/Job controller 132 133 Besides adapting `RunWithPodSetsInfo` and `RestorePodSetsInfo` it should also: 134 135 - rework `PodSets()` to populate `MinCount` if the job is marked to support partial admission. 136 * jobs supporting partial admission should have a dedicated annotation. eg. `kueue.x-k8s.io/job-min-parallelism`, indicating the minimum `parallelism` acceptable by the job in case of partial admission. 137 * jobs which need the `completions` count kept in sync with `parallelism` should indicate this in a second annotation `kueue.x-k8s.io/job-completions-equal-parallelism` 138 - rework `EquivalentToWorkload` to account for potential differences in `PodSets` spec `Parallelism`. 139 140 ### kubeflow/MPIJob controller 141 142 In case of MPIJob `j.Spec.RunPolicy.SchedulingPolicy.MinAvailable` can be used to provide a `minimumCount` for the `Worker` PodSets while updating `j.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeWorker].Replicas` before unsuspending the job and after suspending it. 143 144 Whether an MPIJob supports partial admission or not can be deduced based on `MinAvailable` without the need of a dedicated annotation. 145 Additional research is needed into the potential usage of multiple variable count PodSets. 146 147 ### Test Plan 148 149 No regressions in the current test should be observed. 150 151 [X] I/we understand the owners of the involved components may require updates to 152 existing tests to make this code solid enough prior to committing the changes necessary 153 to implement this enhancement. 154 155 #### Unit Tests 156 157 The changes in the flavorassignment should be covered by unit tests 158 159 #### Integration tests 160 161 The `scheduler` and `controllers/job` should be extended to cover the new capabilities. 162 163 ### Graduation Criteria 164 165 166 ## Implementation History 167 168 169 ## Drawbacks 170 171 172 ## Alternatives 173