sigs.k8s.io/kueue@v0.6.2/keps/78-dynamically-reclaiming-resources/README.md (about) 1 # KEP-78: Dynamically reclaim resources of Pods of a Workload 2 3 ## Table of Contents 4 5 <!-- toc --> 6 - [Summary](#summary) 7 - [Motivation](#motivation) 8 - [Goals](#goals) 9 - [Non-Goals](#non-goals) 10 - [Proposal](#proposal) 11 - [Pod Completion Accounting](#pod-completion-accounting) 12 - [Reference design for (<code>batch/Job</code>)](#reference-design-for-) 13 - [To consider](#to-consider) 14 - [API](#api) 15 - [Implementation](#implementation) 16 - [Workload](#workload) 17 - [API](#api-1) 18 - [<code>pkg/workload</code>](#) 19 - [Jobframework](#jobframework) 20 - [Batch/Job](#batchjob) 21 - [Testing Plan](#testing-plan) 22 - [NonRegression](#nonregression) 23 - [Unit Tests](#unit-tests) 24 - [Integration tests](#integration-tests) 25 - [Implementation History](#implementation-history) 26 <!-- /toc --> 27 28 ## Summary 29 30 This proposal allows Kueue to reclaim the resources of a successfully completed Pod of a Workload even before the whole Workload completes its execution. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently. 31 32 In the remaining of this document: 33 1. *Job* refers to any kind job supported by Kueue including `batch/job`, `MPIJob`, `RayJob`, etc. 34 2. *Pods of a Workload* means the same as *Pods of a Job* and vice-versa. 35 36 37 ## Motivation 38 39 Currently, the quota assigned to a Job is reclaimed by Kueue only when the whole Job finishes. 40 For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. 41 This is not efficient as the Jobs might have different needs during execution, for instance the needs of a `batch/job` having the `parallelism` equal to `completions` will decrease with every Pod finishing it's execution. 42 43 ### Goals 44 45 - Utilize the unused resources of the successfully completed Pods of a running Job. 46 47 ### Non-Goals 48 49 - Preempting a Workload or Pods of a Workload to free resources for a pending Workload. 50 - Partially admitting a Workload. 51 - Monitor the Job's pod execution. 52 53 ## Proposal 54 55 Reclaiming the resources of the succeeded Pods of a running Job as soon as the Pod completes its execution. 56 57 We propose to add a new field `.status.ReclaimablePods` to the Workload API. `.status.ReclaimablePods` is a list that holds the count of Pods belonging to a PodSet whose resources are no longer needed and could be reclaimed by Kueue. 58 59 ### Pod Completion Accounting 60 61 Since the ability of actively monitoring pods execution of a job is not in the scope of the core Kueue implementation, the number of pods for which the resources are no longer needed for each PodSet should be reported by each framework specific `GenericJob` implementation. 62 63 For this purpose the `GenericJob` interface should be changed to add an additional method able to report this 64 65 ```go 66 type GenericJob interface { 67 // ... 68 69 // Get reclaimable pods. 70 ReclaimablePods() []ReclaimablePod 71 } 72 ``` 73 74 #### Reference design for (`batch/Job`) 75 76 Having a job defined with **P** `parallelism` and **C** `completions`, and **n** number of completed pod executions, 77 the expected reclaimablePods should be: 78 79 ```go 80 []ReclaimablePod{ 81 { 82 Name: "main", 83 Count: P - min(P,(C-n)), 84 } 85 } 86 ``` 87 88 ##### To consider 89 According to [kubernetes/enhancements](https://github.com/kubernetes/enhancements) the algorithm presented above might need to be reworked in order to account for: 90 - [KEP-3939](https://github.com/kubernetes/enhancements/pull/3940) which adds a new field `terminating` to account for terminating pods, depending on `spec.RecreatePodsWhen`, when reclaimablePods are computed the new fiels needs to be taken into account. 91 - [KEP-3850](https://github.com/kubernetes/enhancements/pull/3967) which adds the ability for an index to fail. If an index fails, the resources previously reserved for it are no longer needed. 92 93 ### API 94 95 A new field `ReclaimablePods` is added to the `.status` of Workload API. 96 97 ```go 98 // WorkloadStatus defines the observed state of Workload. 99 type WorkloadStatus struct { 100 101 // ... 102 103 104 // reclaimablePods keeps track of the number pods within a podset for which 105 // the resource reservation is no longer needed. 106 // +optional 107 ReclaimablePods []ReclaimablePod `json:"reclaimablePods,omitempty"` 108 } 109 110 type ReclaimablePod struct { 111 // name is the PodSet name. 112 Name string `json:"name"` 113 114 // count is the number of pods for which the requested resources are no longer needed. 115 Count int32 `json:"count"` 116 } 117 ``` 118 119 ## Implementation 120 121 ### Workload 122 #### API 123 124 - Add the new field in the workload's status. 125 - Validate the data in `status.ReclaimablePods`: 126 1. The names must be found in the `PodSets`. 127 2. The cont should never exceed the `PodSets` count. 128 3. The cont should not decrease if the workload is admitted. 129 130 131 #### `pkg/workload` 132 133 Rework the way `Info.TotalRequests` in computed in order to take the `ReclaimablePods` into account. 134 135 ### Jobframework 136 137 Adapt the `GenericJob` interface, and ensure that the `ReclaimablePods` information provided is synced with it's associated workload status. 138 139 ### Batch/Job 140 141 Adapt it's `GenericJob` implementation to the new interface. 142 143 144 ## Testing Plan 145 146 ### NonRegression 147 The new implementation should not impact any of the existing unit, integration or e2e tests. A workload that has no `ReclaimablePods` populated should behave the same as it dose prior to this implementation. 148 149 ### Unit Tests 150 151 All the Kueue's core components must be covered by unit tests. 152 153 ### Integration tests 154 * Scheduler 155 - Checking if a Workload gets admitted when an admitted Workload releases a part of it's assigned resources. 156 157 * Kueue Job Controller (Optional) 158 - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. 159 160 ## Implementation History 161 162 Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78).