sigs.k8s.io/kueue@v0.6.2/keps/1282-pods-ready-requeue-strategy/README.md

sigs.k8s.io/kueue@v0.6.2/keps/1282-pods-ready-requeue-strategy/README.md (about)

     1  # KEP-1282: Pods Ready Requeue Strategy
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11      - [Story 2](#story-2)
    12      - [Story 3](#story-3)
    13    - [Risks and Mitigations](#risks-and-mitigations)
    14  - [Design Details](#design-details)
    15    - [API Changes](#api-changes)
    16      - [KueueConfig](#kueueconfig)
    17      - [Workload](#workload)
    18    - [Changes to Queue Sorting](#changes-to-queue-sorting)
    19      - [Existing Sorting](#existing-sorting)
    20      - [Proposed Sorting](#proposed-sorting)
    21    - [Exponential Backoff Mechanism](#exponential-backoff-mechanism)
    22      - [Evaluation](#evaluation)
    23    - [Test Plan](#test-plan)
    24        - [Prerequisite testing updates](#prerequisite-testing-updates)
    25      - [Unit Tests](#unit-tests)
    26      - [Integration tests](#integration-tests)
    27    - [Graduation Criteria](#graduation-criteria)
    28  - [Implementation History](#implementation-history)
    29  - [Drawbacks](#drawbacks)
    30  - [Alternatives](#alternatives)
    31    - [Create &quot;FrontOfQueue&quot; and &quot;BackOfQueue&quot;](#create-frontofqueue-and-backofqueue)
    32    - [Configure at the ClusterQueue level](#configure-at-the-clusterqueue-level)
    33    - [Make knob to be possible to set timeout until the workload is deactivated](#make-knob-to-be-possible-to-set-timeout-until-the-workload-is-deactivated)
    34      - [Evaluation](#evaluation-1)
    35  <!-- /toc -->
    36  
    37  ## Summary
    38  
    39  Introduce new options that allow administrators to configure how Workloads are placed back in the queue after being after being evicted due to readiness checks.
    40  
    41  ## Motivation
    42  
    43  ### Goals
    44  
    45  * Allowing administrators to configure requeuing behavior to ensure fair resource sharing after workloads fail to start running after they have been admitted.
    46  
    47  ### Non-Goals
    48  
    49  * Providing options for how to sort requeued workloads after priority-based evictions (no user stories).
    50  
    51  ## Proposal
    52  
    53  Make queue placement after pod-readiness eviction configurable at the level of Kueue configuration.
    54  
    55  ### User Stories (Optional)
    56  
    57  #### Story 1
    58  
    59  Consider the following scenario:
    60  
    61  * A ClusterQueue has 2 ResourceFlavors.
    62  * Kueue has admitted a workload on ResourceFlavor #2.
    63  * There is a stock-out on the machine type needed to schedule this workload and the cluster autoscaler is unable to provision the necessary Nodes.
    64  * The workload gets evicted by Kueue because an administrator has configured the `waitForPodsReady` setting.
    65  * While the workload was pending capacity freed up on ResourceFlavor #1.
    66  
    67  In this case, the administrator would like the evicted workload to be requeued as soon as possible on the newly available capacity.
    68  
    69  #### Story 2
    70  
    71  In the story 1 scenario, when we set `waitForPodsReady.requeuingStrategy.timestamp=Creation`, 
    72  the workload endlessly or repeatedly can be put in front of the queue after eviction in the following eviction reasons:
    73  
    74  1. The workload don't have the proper configurations like image pull credential and pvc name, etc.
    75  2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests.  
    76  3. If there are multiple resource flavors that match the workload (for example, flavors 1 & 2)
    77     and the workload was running on flavor 2, it's likely that the workload will be readmitted
    78     on the same flavor indefinitely.
    79  
    80  Specifically, the second reason will often occur if the available quota is fragmented across multiple nodes,
    81  such that the workload can't be scheduled in a node even though there is enough quota in the cluster.
    82  
    83  For example, Given that the workload with a request of 2 gpus is submitted to the cluster that
    84  has 2 worker nodes with 4 gpus, and 3 gpus are used (which means 1 gpu is free in each node),
    85  the workload will be repeatedly evicted because of the lack of resources in each node even though the cluster has enough capacities.
    86  
    87  In this case, to avoid rapid repetition of the admission and eviction cycle,
    88  the administrator would like to use an exponential backoff mechanism and add a maximum number of retries.
    89  
    90  #### Story 3
    91  
    92  In the story 2 scenario, after the evicted workload reaches the maximum retry criterion
    93  and the workload is never backoff, we want to easily requeue the workload to the queue without recreating the job.
    94  This is possible if the Workload is deactivated (`.spec.active`=`false`) as opposed to deleting it.
    95  
    96  ### Risks and Mitigations
    97  
    98  <!--
    99  What are the risks of this proposal, and how do we mitigate? Think broadly.
   100  For example, consider both security and how this will impact the larger
   101  Kubernetes ecosystem.
   102  
   103  How will security be reviewed, and by whom?
   104  
   105  How will UX be reviewed, and by whom?
   106  
   107  Consider including folks who also work outside the SIG or subproject.
   108  -->
   109  
   110  
   111  ## Design Details
   112  
   113  ### API Changes
   114  
   115  #### KueueConfig
   116  
   117  Add fields to the KueueConfig to allow administrators to specify what timestamp to consider during queue sorting (under the pre-existing waitForPodsReady block).
   118  
   119  Possible settings:
   120  
   121  * `Eviction` (Back of queue)
   122  * `Creation` (Front of queue)
   123  
   124  ```go
   125  type WaitForPodsReady struct {
   126  	...
   127  	// RequeuingStrategy defines the strategy for requeuing a Workload.
   128  	// +optional
   129  	RequeuingStrategy *RequeuingStrategy `json:"requeuingStrategy,omitempty"`
   130  }
   131  
   132  type RequeuingStrategy struct {
   133  	// Timestamp defines the timestamp used for requeuing a Workload
   134  	// that was evicted due to Pod readiness. The possible values are:
   135  	//
   136  	// - `Eviction` (default) indicates from Workload `Evicted` condition with `PodsReadyTimeout` reason.
   137  	// - `Creation` indicates from Workload .metadata.creationTimestamp.
   138  	//
   139  	// +optional
   140  	Timestamp *RequeuingTimestamp `json:"timestamp,omitempty"`
   141  
   142  	// BackoffLimitCount defines the maximum number of re-queuing retries.
   143  	// Once the number is reached, the workload is deactivated (`.spec.activate`=`false`).
   144  	// When it is null, the workloads will repeatedly and endless re-queueing.
   145  	//
   146  	// Every backoff duration is about "1.41284738^(n-1)+Rand" where the "n" represents the "workloadStatus.requeueState.count",
   147  	// and the "Rand" represents the random jitter. During this time, the workload is taken as an inadmissible and
   148  	// other workloads will have a chance to be admitted.
   149  	// For example, when the "waitForPodsReady.timeout" is the default, the workload deactivation time is as follows:
   150  	//   {backoffLimitCount, workloadDeactivationSeconds}
   151  	//     ~= {1, 601}, {2, 902}, ...,{5, 1811}, ...,{10, 3374}, ...,{20, 8730}, ...,{30, 86400(=24 hours)}, ...
   152  	//
   153  	// Defaults to null.
   154  	// +optional
   155  	BackoffLimitCount *int32 `json:"backoffLimitCount,omitempty"`
   156  }
   157  
   158  type RequeuingTimestamp string
   159  
   160  const (
   161  	// CreationTimestamp timestamp (from Workload .metadata.creationTimestamp).
   162  	CreationTimestamp RequeuingTimestamp = "Creation"
   163      
   164  	// EvictionTimestamp timestamp (from Workload .status.conditions).
   165  	EvictionTimestamp RequeuingTimestamp = "Eviction"
   166  )
   167  ```
   168  
   169  #### Workload
   170  
   171  Add a new field, "requeueState", to the Workload to allow recording the following items: 
   172  
   173  1. the number of times a workload is re-queued
   174  2. when the workload was re-queued or will be re-queued
   175  
   176  ```go
   177  type WorkloadStatus struct {
   178  	...
   179  	// requeueState holds the re-queue state
   180  	// when a workload meets Eviction with PodsReadyTimeout reason.
   181  	//
   182  	// +optional	
   183  	RequeueState *RequeueState `json:"requeueState,omitempty"`
   184  }
   185  
   186  type RequeueState struct {
   187  	// count records the number of times a workload has been re-queued
   188  	// When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`),
   189  	// this count would be reset to null.
   190  	//
   191  	// +optional
   192  	// +kubebuilder:validation:Minimum=0
   193  	Count *int32 `json:"count,omitempty"`
   194      
   195  	// requeueAt records the time when a workload will be re-queued.
   196  	// When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`),
   197  	// this time would be reset to null.
   198  	//
   199  	// +optional
   200  	RequeueAt *metav1.Time `json:"requeueAt,omitempty"`
   201  }
   202  ```
   203  
   204  ### Changes to Queue Sorting
   205  
   206  #### Existing Sorting
   207  
   208  Currently, workloads within a ClusterQueue are sorted based on 1. Priority and 2. Timestamp of eviction - if evicted, otherwise time of creation.
   209  
   210  #### Proposed Sorting
   211  
   212  The `pkg/workload` package could be modified to include a conditional (`if evictionReason == kueue.WorkloadEvictedByPodsReadyTimeout`) 
   213  that controls which timestamp to return based on the configured ordering strategy.
   214  The same sorting logic would also be used when sorting the heads of queues.
   215  
   216  Update the `apis/config/<version>` package to include `Creation` and `Eviction` constants.
   217  
   218  ### Exponential Backoff Mechanism
   219  
   220  When the kueueConfig `backoffLimitCount` is set and there are evicted workloads by waitForPodsReady,
   221  the queueManager holds the evicted workloads as inadmissible workloads while exponential backoff duration.
   222  Duration this time, other workloads will have a chance to be admitted.
   223  
   224  The queueManager calculates an exponential backoff duration by [the Step function](https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait@v0.29.1#Backoff.Step)
   225  according to the $1.41284738^{(n-1)}+Rand$ where the $n$ represents the `workloadStatus.requeueState.count`, and the $Rand$ represents the random jitter.
   226  
   227  Considering the `.waitForPodsReady.timeout` (default: 300 seconds),
   228  this duration indicates that an evicted workload with `PodsReadyTimeout` reason is continued re-queuing 
   229  for the following period where the $t$ represents `.waitForPodsReady.timeout`:
   230  
   231  $$t(n+1) + \sum_{k=1}^{n}(1.41284738^{(k-1)} + Rand)$$
   232  
   233  Given that the `backoffLimitCount` equals `30` and the `waitForPodsReady.timeout` equals `300` (default),
   234  the result equals 24 hours (+ $Rand$ seconds).
   235  
   236  #### Evaluation
   237  
   238  When a workload eviction is issued with the `PodsReadyTimeout` condition,
   239  the workload controller increments the `.status.requeueState.count` by 1 each time and
   240  sets the time to the `.status.requeueState.requeueAt` when a workload will be re-queued.
   241  
   242  If a workload `.status.requeueState.count` reaches the kueueConfig `.waitForPodsReady.requeueingStrategy.backoffLimitCount`,
   243  the workload controller doesn't modify `.status.requeueState` and deactivates a workload by setting false to `.spec.active`.
   244  
   245  After that, the jobframework reconciler adds `Evicted` condition with `WorkloadInactive` reason to a workload.
   246  Finally, the jobframework reconciler stops a job based in the next reconcile.
   247  
   248  Additionally, when a deactivated workload by eviction is re-activated, the `requeueState` is reset to null. 
   249  If a workload is deactivated by other ways such as user operations, the `requeueState` is not reset.
   250  
   251  ### Test Plan
   252  
   253  [X] I/we understand the owners of the involved components may require updates to
   254  existing tests to make this code solid enough prior to committing the changes necessary
   255  to implement this enhancement.
   256  
   257  ##### Prerequisite testing updates
   258  
   259  <!--
   260  Based on reviewers feedback describe what additional tests need to be added prior
   261  implementing this enhancement to ensure the enhancements have also solid foundations.
   262  -->
   263  
   264  #### Unit Tests
   265  
   266  Most of the test coverage should probably live inside of `pkd/queue`. Additional test cases should be added that test different requeuing configurations.
   267  
   268  - `pkg/queue`: `Nov 2 2023` - `33.9%`
   269  
   270  #### Integration tests
   271  
   272  - Add integration test that matches user story 1.
   273  - Add an integration test to detect if flapping associated with preempted workloads being readmitted before the preemptor workload when `requeuingTimestamp: Creation` is set.
   274  
   275  ### Graduation Criteria
   276  
   277  <!--
   278  
   279  Clearly define what it means for the feature to be implemented and
   280  considered stable.
   281  
   282  If the feature you are introducing has high complexity, consider adding graduation
   283  milestones with these graduation criteria:
   284  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   285  - [Feature gate][feature gate] lifecycle
   286  - [Deprecation policy][deprecation-policy]
   287  
   288  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   289  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   290  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   291  -->
   292  
   293  ## Implementation History
   294  
   295  - Jan 18th: Implemented the re-queue strategy that workloads evicted due to pods-ready (story 1) [#1311](https://github.com/kubernetes-sigs/kueue/pulls/1311)
   296  - Feb 12th: Implemented the re-queueing backoff mechanism triggered by eviction with PodsReadyTimeout reason (story 2 and 3) [#1709](https://github.com/kubernetes-sigs/kueue/pulls/1709)
   297  
   298  <!--
   299  Major milestones in the lifecycle of a KEP should be tracked in this section.
   300  Major milestones might include:
   301  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   302  - the `Proposal` section being merged, signaling agreement on a proposed design
   303  - the date implementation started
   304  - the first Kubernetes release where an initial version of the KEP was available
   305  - the version of Kubernetes where the KEP graduated to general availability
   306  - when the KEP was retired or superseded
   307  -->
   308  
   309  ## Drawbacks
   310  
   311  * When used with `StrictFIFO`, the `requeuingStrategy.timestamp: Creation` (front of queue) policy could lead to a blocked queue. This was called out in the issue that set the hardcoded [back-of-queue behavior](https://github.com/kubernetes-sigs/kueue/issues/599). 
   312  This could be mitigated by recommending administrators select `BestEffortFIFO` when using this setting.
   313  * Pods that never become ready due to invalid images will constantly be requeued to the front of the queue when the creation timestamp is used. [See Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/122300).
   314  
   315  ## Alternatives
   316  
   317  ### Create "FrontOfQueue" and "BackOfQueue"
   318  
   319  The same concepts could be exposed to users based on `FrontOfQueue` or `BackOfQueue` settings instead of `Creation` and `Eviction` timestamps. 
   320  These terms would imply that the workload would be prioritized over higher priority workloads in the queue.
   321  This is probably not desired (would likely lead to rapid preemption upon admission when preemption based on priority is enabled). 
   322  
   323  ### Configure at the ClusterQueue level
   324  
   325  These concepts could be configured in the ClusterQueue resource. This alternative would increase flexibility.
   326  Without a clear need for this level of granularity, it might be better to set these options at the controller level where `waitForPodsReady` settings already exist.
   327  Furthermore, configuring these settings at the ClusterQueue level introduces the question of what timestamp to use when sorting the heads of all ClusterQueues.
   328  
   329  ### Make knob to be possible to set timeout until the workload is deactivated
   330  
   331  Another knob, `backoffCount` is difficult to estimate how many hours jobs will actually be retried (requeued).
   332  So, it might be useful to make a knob to possible to set timeout until the workload is deactivated.
   333  For the first iteration, we don't make this knob since only `backoffLimitCount` would be enough to current stories.  
   334  
   335  ```go
   336  type RequeuingStrategy struct {
   337  	...
   338  	// backoffLimitTimeout defines the time for a workload that 
   339  	// has once been admitted to reach the PodsReady=true condition.
   340  	// When the time is reached, the workload is deactivated.
   341  	// 	
   342  	// Defaults to null.
   343  	// +optional
   344  	BackOffLimitTimeout *int32 `json:"backoffLimitTimeout,omitempty"`
   345  }
   346  ```
   347  
   348  #### Evaluation
   349  
   350  When a workload's duration $currentTime - queueOrderingTimestamp$ reaches the kueueConfig `waitForPodsReady.requeueingStrategy.backoffLimitTimeout`,
   351  the workload controller and the queueManager sets false to `.spec.active`.
   352  After that, the jobframework reconciler deactivates a workload.
   353  
   354  Before the jobframework reconciler deactivates a workload,
   355  the workload controller sets false to `.spec.active` after the workload reconciler checks if a workload is finished.
   356  In addition, when the kueue scheduler gets headWorkloads from clusterQueues,
   357  if the queueManager finds the workloads exceeding `backoffLimitTimeout` and sets false to workload `.spec.active`.