sigs.k8s.io/kueue@v0.6.2/keps/1282-pods-ready-requeue-strategy/README.md (about) 1 # KEP-1282: Pods Ready Requeue Strategy 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories (Optional)](#user-stories-optional) 10 - [Story 1](#story-1) 11 - [Story 2](#story-2) 12 - [Story 3](#story-3) 13 - [Risks and Mitigations](#risks-and-mitigations) 14 - [Design Details](#design-details) 15 - [API Changes](#api-changes) 16 - [KueueConfig](#kueueconfig) 17 - [Workload](#workload) 18 - [Changes to Queue Sorting](#changes-to-queue-sorting) 19 - [Existing Sorting](#existing-sorting) 20 - [Proposed Sorting](#proposed-sorting) 21 - [Exponential Backoff Mechanism](#exponential-backoff-mechanism) 22 - [Evaluation](#evaluation) 23 - [Test Plan](#test-plan) 24 - [Prerequisite testing updates](#prerequisite-testing-updates) 25 - [Unit Tests](#unit-tests) 26 - [Integration tests](#integration-tests) 27 - [Graduation Criteria](#graduation-criteria) 28 - [Implementation History](#implementation-history) 29 - [Drawbacks](#drawbacks) 30 - [Alternatives](#alternatives) 31 - [Create "FrontOfQueue" and "BackOfQueue"](#create-frontofqueue-and-backofqueue) 32 - [Configure at the ClusterQueue level](#configure-at-the-clusterqueue-level) 33 - [Make knob to be possible to set timeout until the workload is deactivated](#make-knob-to-be-possible-to-set-timeout-until-the-workload-is-deactivated) 34 - [Evaluation](#evaluation-1) 35 <!-- /toc --> 36 37 ## Summary 38 39 Introduce new options that allow administrators to configure how Workloads are placed back in the queue after being after being evicted due to readiness checks. 40 41 ## Motivation 42 43 ### Goals 44 45 * Allowing administrators to configure requeuing behavior to ensure fair resource sharing after workloads fail to start running after they have been admitted. 46 47 ### Non-Goals 48 49 * Providing options for how to sort requeued workloads after priority-based evictions (no user stories). 50 51 ## Proposal 52 53 Make queue placement after pod-readiness eviction configurable at the level of Kueue configuration. 54 55 ### User Stories (Optional) 56 57 #### Story 1 58 59 Consider the following scenario: 60 61 * A ClusterQueue has 2 ResourceFlavors. 62 * Kueue has admitted a workload on ResourceFlavor #2. 63 * There is a stock-out on the machine type needed to schedule this workload and the cluster autoscaler is unable to provision the necessary Nodes. 64 * The workload gets evicted by Kueue because an administrator has configured the `waitForPodsReady` setting. 65 * While the workload was pending capacity freed up on ResourceFlavor #1. 66 67 In this case, the administrator would like the evicted workload to be requeued as soon as possible on the newly available capacity. 68 69 #### Story 2 70 71 In the story 1 scenario, when we set `waitForPodsReady.requeuingStrategy.timestamp=Creation`, 72 the workload endlessly or repeatedly can be put in front of the queue after eviction in the following eviction reasons: 73 74 1. The workload don't have the proper configurations like image pull credential and pvc name, etc. 75 2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests. 76 3. If there are multiple resource flavors that match the workload (for example, flavors 1 & 2) 77 and the workload was running on flavor 2, it's likely that the workload will be readmitted 78 on the same flavor indefinitely. 79 80 Specifically, the second reason will often occur if the available quota is fragmented across multiple nodes, 81 such that the workload can't be scheduled in a node even though there is enough quota in the cluster. 82 83 For example, Given that the workload with a request of 2 gpus is submitted to the cluster that 84 has 2 worker nodes with 4 gpus, and 3 gpus are used (which means 1 gpu is free in each node), 85 the workload will be repeatedly evicted because of the lack of resources in each node even though the cluster has enough capacities. 86 87 In this case, to avoid rapid repetition of the admission and eviction cycle, 88 the administrator would like to use an exponential backoff mechanism and add a maximum number of retries. 89 90 #### Story 3 91 92 In the story 2 scenario, after the evicted workload reaches the maximum retry criterion 93 and the workload is never backoff, we want to easily requeue the workload to the queue without recreating the job. 94 This is possible if the Workload is deactivated (`.spec.active`=`false`) as opposed to deleting it. 95 96 ### Risks and Mitigations 97 98 <!-- 99 What are the risks of this proposal, and how do we mitigate? Think broadly. 100 For example, consider both security and how this will impact the larger 101 Kubernetes ecosystem. 102 103 How will security be reviewed, and by whom? 104 105 How will UX be reviewed, and by whom? 106 107 Consider including folks who also work outside the SIG or subproject. 108 --> 109 110 111 ## Design Details 112 113 ### API Changes 114 115 #### KueueConfig 116 117 Add fields to the KueueConfig to allow administrators to specify what timestamp to consider during queue sorting (under the pre-existing waitForPodsReady block). 118 119 Possible settings: 120 121 * `Eviction` (Back of queue) 122 * `Creation` (Front of queue) 123 124 ```go 125 type WaitForPodsReady struct { 126 ... 127 // RequeuingStrategy defines the strategy for requeuing a Workload. 128 // +optional 129 RequeuingStrategy *RequeuingStrategy `json:"requeuingStrategy,omitempty"` 130 } 131 132 type RequeuingStrategy struct { 133 // Timestamp defines the timestamp used for requeuing a Workload 134 // that was evicted due to Pod readiness. The possible values are: 135 // 136 // - `Eviction` (default) indicates from Workload `Evicted` condition with `PodsReadyTimeout` reason. 137 // - `Creation` indicates from Workload .metadata.creationTimestamp. 138 // 139 // +optional 140 Timestamp *RequeuingTimestamp `json:"timestamp,omitempty"` 141 142 // BackoffLimitCount defines the maximum number of re-queuing retries. 143 // Once the number is reached, the workload is deactivated (`.spec.activate`=`false`). 144 // When it is null, the workloads will repeatedly and endless re-queueing. 145 // 146 // Every backoff duration is about "1.41284738^(n-1)+Rand" where the "n" represents the "workloadStatus.requeueState.count", 147 // and the "Rand" represents the random jitter. During this time, the workload is taken as an inadmissible and 148 // other workloads will have a chance to be admitted. 149 // For example, when the "waitForPodsReady.timeout" is the default, the workload deactivation time is as follows: 150 // {backoffLimitCount, workloadDeactivationSeconds} 151 // ~= {1, 601}, {2, 902}, ...,{5, 1811}, ...,{10, 3374}, ...,{20, 8730}, ...,{30, 86400(=24 hours)}, ... 152 // 153 // Defaults to null. 154 // +optional 155 BackoffLimitCount *int32 `json:"backoffLimitCount,omitempty"` 156 } 157 158 type RequeuingTimestamp string 159 160 const ( 161 // CreationTimestamp timestamp (from Workload .metadata.creationTimestamp). 162 CreationTimestamp RequeuingTimestamp = "Creation" 163 164 // EvictionTimestamp timestamp (from Workload .status.conditions). 165 EvictionTimestamp RequeuingTimestamp = "Eviction" 166 ) 167 ``` 168 169 #### Workload 170 171 Add a new field, "requeueState", to the Workload to allow recording the following items: 172 173 1. the number of times a workload is re-queued 174 2. when the workload was re-queued or will be re-queued 175 176 ```go 177 type WorkloadStatus struct { 178 ... 179 // requeueState holds the re-queue state 180 // when a workload meets Eviction with PodsReadyTimeout reason. 181 // 182 // +optional 183 RequeueState *RequeueState `json:"requeueState,omitempty"` 184 } 185 186 type RequeueState struct { 187 // count records the number of times a workload has been re-queued 188 // When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`), 189 // this count would be reset to null. 190 // 191 // +optional 192 // +kubebuilder:validation:Minimum=0 193 Count *int32 `json:"count,omitempty"` 194 195 // requeueAt records the time when a workload will be re-queued. 196 // When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`), 197 // this time would be reset to null. 198 // 199 // +optional 200 RequeueAt *metav1.Time `json:"requeueAt,omitempty"` 201 } 202 ``` 203 204 ### Changes to Queue Sorting 205 206 #### Existing Sorting 207 208 Currently, workloads within a ClusterQueue are sorted based on 1. Priority and 2. Timestamp of eviction - if evicted, otherwise time of creation. 209 210 #### Proposed Sorting 211 212 The `pkg/workload` package could be modified to include a conditional (`if evictionReason == kueue.WorkloadEvictedByPodsReadyTimeout`) 213 that controls which timestamp to return based on the configured ordering strategy. 214 The same sorting logic would also be used when sorting the heads of queues. 215 216 Update the `apis/config/<version>` package to include `Creation` and `Eviction` constants. 217 218 ### Exponential Backoff Mechanism 219 220 When the kueueConfig `backoffLimitCount` is set and there are evicted workloads by waitForPodsReady, 221 the queueManager holds the evicted workloads as inadmissible workloads while exponential backoff duration. 222 Duration this time, other workloads will have a chance to be admitted. 223 224 The queueManager calculates an exponential backoff duration by [the Step function](https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait@v0.29.1#Backoff.Step) 225 according to the $1.41284738^{(n-1)}+Rand$ where the $n$ represents the `workloadStatus.requeueState.count`, and the $Rand$ represents the random jitter. 226 227 Considering the `.waitForPodsReady.timeout` (default: 300 seconds), 228 this duration indicates that an evicted workload with `PodsReadyTimeout` reason is continued re-queuing 229 for the following period where the $t$ represents `.waitForPodsReady.timeout`: 230 231 $$t(n+1) + \sum_{k=1}^{n}(1.41284738^{(k-1)} + Rand)$$ 232 233 Given that the `backoffLimitCount` equals `30` and the `waitForPodsReady.timeout` equals `300` (default), 234 the result equals 24 hours (+ $Rand$ seconds). 235 236 #### Evaluation 237 238 When a workload eviction is issued with the `PodsReadyTimeout` condition, 239 the workload controller increments the `.status.requeueState.count` by 1 each time and 240 sets the time to the `.status.requeueState.requeueAt` when a workload will be re-queued. 241 242 If a workload `.status.requeueState.count` reaches the kueueConfig `.waitForPodsReady.requeueingStrategy.backoffLimitCount`, 243 the workload controller doesn't modify `.status.requeueState` and deactivates a workload by setting false to `.spec.active`. 244 245 After that, the jobframework reconciler adds `Evicted` condition with `WorkloadInactive` reason to a workload. 246 Finally, the jobframework reconciler stops a job based in the next reconcile. 247 248 Additionally, when a deactivated workload by eviction is re-activated, the `requeueState` is reset to null. 249 If a workload is deactivated by other ways such as user operations, the `requeueState` is not reset. 250 251 ### Test Plan 252 253 [X] I/we understand the owners of the involved components may require updates to 254 existing tests to make this code solid enough prior to committing the changes necessary 255 to implement this enhancement. 256 257 ##### Prerequisite testing updates 258 259 <!-- 260 Based on reviewers feedback describe what additional tests need to be added prior 261 implementing this enhancement to ensure the enhancements have also solid foundations. 262 --> 263 264 #### Unit Tests 265 266 Most of the test coverage should probably live inside of `pkd/queue`. Additional test cases should be added that test different requeuing configurations. 267 268 - `pkg/queue`: `Nov 2 2023` - `33.9%` 269 270 #### Integration tests 271 272 - Add integration test that matches user story 1. 273 - Add an integration test to detect if flapping associated with preempted workloads being readmitted before the preemptor workload when `requeuingTimestamp: Creation` is set. 274 275 ### Graduation Criteria 276 277 <!-- 278 279 Clearly define what it means for the feature to be implemented and 280 considered stable. 281 282 If the feature you are introducing has high complexity, consider adding graduation 283 milestones with these graduation criteria: 284 - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] 285 - [Feature gate][feature gate] lifecycle 286 - [Deprecation policy][deprecation-policy] 287 288 [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md 289 [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions 290 [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ 291 --> 292 293 ## Implementation History 294 295 - Jan 18th: Implemented the re-queue strategy that workloads evicted due to pods-ready (story 1) [#1311](https://github.com/kubernetes-sigs/kueue/pulls/1311) 296 - Feb 12th: Implemented the re-queueing backoff mechanism triggered by eviction with PodsReadyTimeout reason (story 2 and 3) [#1709](https://github.com/kubernetes-sigs/kueue/pulls/1709) 297 298 <!-- 299 Major milestones in the lifecycle of a KEP should be tracked in this section. 300 Major milestones might include: 301 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 302 - the `Proposal` section being merged, signaling agreement on a proposed design 303 - the date implementation started 304 - the first Kubernetes release where an initial version of the KEP was available 305 - the version of Kubernetes where the KEP graduated to general availability 306 - when the KEP was retired or superseded 307 --> 308 309 ## Drawbacks 310 311 * When used with `StrictFIFO`, the `requeuingStrategy.timestamp: Creation` (front of queue) policy could lead to a blocked queue. This was called out in the issue that set the hardcoded [back-of-queue behavior](https://github.com/kubernetes-sigs/kueue/issues/599). 312 This could be mitigated by recommending administrators select `BestEffortFIFO` when using this setting. 313 * Pods that never become ready due to invalid images will constantly be requeued to the front of the queue when the creation timestamp is used. [See Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/122300). 314 315 ## Alternatives 316 317 ### Create "FrontOfQueue" and "BackOfQueue" 318 319 The same concepts could be exposed to users based on `FrontOfQueue` or `BackOfQueue` settings instead of `Creation` and `Eviction` timestamps. 320 These terms would imply that the workload would be prioritized over higher priority workloads in the queue. 321 This is probably not desired (would likely lead to rapid preemption upon admission when preemption based on priority is enabled). 322 323 ### Configure at the ClusterQueue level 324 325 These concepts could be configured in the ClusterQueue resource. This alternative would increase flexibility. 326 Without a clear need for this level of granularity, it might be better to set these options at the controller level where `waitForPodsReady` settings already exist. 327 Furthermore, configuring these settings at the ClusterQueue level introduces the question of what timestamp to use when sorting the heads of all ClusterQueues. 328 329 ### Make knob to be possible to set timeout until the workload is deactivated 330 331 Another knob, `backoffCount` is difficult to estimate how many hours jobs will actually be retried (requeued). 332 So, it might be useful to make a knob to possible to set timeout until the workload is deactivated. 333 For the first iteration, we don't make this knob since only `backoffLimitCount` would be enough to current stories. 334 335 ```go 336 type RequeuingStrategy struct { 337 ... 338 // backoffLimitTimeout defines the time for a workload that 339 // has once been admitted to reach the PodsReady=true condition. 340 // When the time is reached, the workload is deactivated. 341 // 342 // Defaults to null. 343 // +optional 344 BackOffLimitTimeout *int32 `json:"backoffLimitTimeout,omitempty"` 345 } 346 ``` 347 348 #### Evaluation 349 350 When a workload's duration $currentTime - queueOrderingTimestamp$ reaches the kueueConfig `waitForPodsReady.requeueingStrategy.backoffLimitTimeout`, 351 the workload controller and the queueManager sets false to `.spec.active`. 352 After that, the jobframework reconciler deactivates a workload. 353 354 Before the jobframework reconciler deactivates a workload, 355 the workload controller sets false to `.spec.active` after the workload reconciler checks if a workload is finished. 356 In addition, when the kueue scheduler gets headWorkloads from clusterQueues, 357 if the queueManager finds the workloads exceeding `backoffLimitTimeout` and sets false to workload `.spec.active`.