sigs.k8s.io/kueue@v0.6.2/keps/78-dynamically-reclaiming-resources/README.md (about)

     1  # KEP-78:  Dynamically reclaim resources of Pods of a Workload
     2  
     3  ## Table of Contents
     4  
     5  <!-- toc -->
     6  - [Summary](#summary)
     7  - [Motivation](#motivation)
     8    - [Goals](#goals)
     9    - [Non-Goals](#non-goals)
    10  - [Proposal](#proposal)
    11    - [Pod Completion Accounting](#pod-completion-accounting)
    12      - [Reference design for (<code>batch/Job</code>)](#reference-design-for-)
    13        - [To consider](#to-consider)
    14    - [API](#api)
    15  - [Implementation](#implementation)
    16    - [Workload](#workload)
    17      - [API](#api-1)
    18      - [<code>pkg/workload</code>](#)
    19    - [Jobframework](#jobframework)
    20    - [Batch/Job](#batchjob)
    21  - [Testing Plan](#testing-plan)
    22    - [NonRegression](#nonregression)
    23    - [Unit Tests](#unit-tests)
    24    - [Integration tests](#integration-tests)
    25  - [Implementation History](#implementation-history)
    26  <!-- /toc -->
    27  
    28  ## Summary
    29  
    30  This proposal allows Kueue to reclaim the resources of a successfully completed Pod of a Workload even before the whole Workload completes its execution. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently.
    31  
    32  In the remaining of this document:
    33  1. *Job* refers to any kind job supported by Kueue including `batch/job`, `MPIJob`, `RayJob`, etc.
    34  2. *Pods of a Workload* means the same as *Pods of a Job* and vice-versa.
    35  
    36  
    37  ## Motivation
    38  
    39  Currently, the quota assigned to a Job is reclaimed by Kueue only when the whole Job finishes.
    40  For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes.
    41  This is not efficient as the Jobs might have different needs during execution, for instance the needs of a `batch/job` having the `parallelism` equal to `completions` will decrease with every Pod finishing it's execution.
    42  
    43  ### Goals
    44  
    45  - Utilize the unused resources of the successfully completed Pods of a running Job.
    46  
    47  ### Non-Goals
    48  
    49  - Preempting a Workload or Pods of a Workload to free resources for a pending Workload.
    50  - Partially admitting a Workload.
    51  - Monitor the Job's pod execution.
    52  
    53  ## Proposal
    54  
    55  Reclaiming the resources of the succeeded Pods of a running Job as soon as the Pod completes its execution.
    56  
    57  We propose to add a new field `.status.ReclaimablePods` to the Workload API. `.status.ReclaimablePods` is a list that holds the count of Pods belonging to a PodSet whose resources are no longer needed and could be reclaimed by Kueue.
    58  
    59  ### Pod Completion Accounting
    60  
    61  Since the ability of actively monitoring pods execution of a job is not in the scope of the core Kueue implementation, the number of pods for which the resources are no longer needed for each PodSet should be reported by each framework specific `GenericJob` implementation.
    62  
    63  For this purpose the `GenericJob` interface should be changed to add an additional method able to report this
    64  
    65  ```go
    66  type GenericJob interface {
    67      // ...
    68  
    69      // Get reclaimable pods.
    70      ReclaimablePods() []ReclaimablePod
    71  }
    72  ```
    73  
    74  #### Reference design for (`batch/Job`)
    75  
    76  Having a job defined with **P** `parallelism` and **C** `completions`, and **n** number of completed pod executions,
    77  the expected reclaimablePods should be:
    78  
    79  ```go
    80  []ReclaimablePod{
    81      {
    82          Name: "main",
    83          Count: P - min(P,(C-n)),
    84      }
    85  }
    86  ```
    87  
    88  ##### To consider
    89  According to [kubernetes/enhancements](https://github.com/kubernetes/enhancements) the algorithm presented above might need to be reworked in order to account for:
    90  - [KEP-3939](https://github.com/kubernetes/enhancements/pull/3940) which adds a new field `terminating` to account for terminating pods, depending on `spec.RecreatePodsWhen`, when reclaimablePods are computed the new fiels needs to be taken into account.
    91  - [KEP-3850](https://github.com/kubernetes/enhancements/pull/3967) which adds the ability for an index to fail. If an index fails, the resources previously reserved for it are no longer needed.
    92  
    93  ### API
    94  
    95  A new field `ReclaimablePods` is added to the `.status` of Workload API.
    96  
    97  ```go
    98  // WorkloadStatus defines the observed state of Workload.
    99  type WorkloadStatus struct {
   100  
   101      // ...
   102  
   103  
   104      // reclaimablePods keeps track of the number pods within a podset for which
   105      // the resource reservation is no longer needed.
   106      // +optional
   107      ReclaimablePods []ReclaimablePod `json:"reclaimablePods,omitempty"`
   108  }
   109  
   110  type ReclaimablePod struct {
   111      // name is the PodSet name.
   112      Name string `json:"name"`
   113  
   114      // count is the number of pods for which the requested resources are no longer needed.
   115      Count int32 `json:"count"`
   116  }
   117  ```
   118  
   119  ## Implementation
   120  
   121  ### Workload
   122  #### API
   123  
   124  - Add the new field in the workload's status.
   125  - Validate the data in `status.ReclaimablePods`:
   126    1. The names must be found in the `PodSets`.
   127    2. The cont should never exceed the `PodSets` count.
   128    3. The cont should not decrease if the workload is admitted.
   129  
   130  
   131  #### `pkg/workload`
   132  
   133  Rework the way `Info.TotalRequests` in computed in order to take the `ReclaimablePods` into account.
   134  
   135  ### Jobframework
   136  
   137  Adapt the `GenericJob` interface, and ensure that the `ReclaimablePods` information provided is synced with it's associated workload status.
   138  
   139  ### Batch/Job
   140  
   141  Adapt it's `GenericJob` implementation to the new interface.
   142  
   143  
   144  ## Testing Plan
   145  
   146  ### NonRegression
   147  The new implementation should not impact any of the existing unit, integration or e2e tests. A workload that has no `ReclaimablePods` populated should behave the same as it dose prior to this implementation.
   148  
   149  ### Unit Tests
   150  
   151  All the Kueue's core components must be covered by unit tests.
   152  
   153  ### Integration tests
   154  * Scheduler
   155    - Checking if a Workload gets admitted when an admitted Workload releases a part of it's assigned resources.
   156  
   157  * Kueue Job Controller (Optional)
   158    - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed.
   159  
   160  ## Implementation History
   161  
   162  Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78).