sigs.k8s.io/kueue@v0.6.2/keps/1224-lending-limit/README.md (about)

     1  # KEP-1224: Introducing lendingLimit to help reserve guaranteed resources
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    12    - [Risks and Mitigations](#risks-and-mitigations)
    13  - [Design Details](#design-details)
    14    - [Kueue LendingLimit API](#kueue-lendinglimit-api)
    15      - [Note](#note)
    16    - [Test Plan](#test-plan)
    17        - [Prerequisite testing updates](#prerequisite-testing-updates)
    18      - [Unit Tests](#unit-tests)
    19      - [Integration tests](#integration-tests)
    20    - [Graduation Criteria](#graduation-criteria)
    21  - [Implementation History](#implementation-history)
    22  - [Drawbacks](#drawbacks)
    23  - [Alternatives](#alternatives)
    24    <!-- /toc -->
    25  
    26  ## Summary
    27  
    28  Under the current implementation, one ClusterQueue's resources could be borrowed completely by others in the same cohort, this improves the resource utilization to some extent, but sometimes, user wants to reserve some resources only for private usage. 
    29  
    30  This proposal provides a guarantee mechanism for users to solve this problem. They can have a reservation of resource quota that will never be borrowed by other clusterqueues in the same cohort.
    31  
    32  ## Motivation
    33  
    34  Sometimes we want to keep some resources for guaranteed usage, so that when new jobs come into queue, they can be admitted immediately.
    35  
    36  Under the current implementation, we are using `BorrowingLimit` to define the maximum amount of quota that this ClusterQueue is allowed to borrow. But this may cause another ClusterQueue in the same cohort to run out of resources.
    37  
    38  Even if we set the `Preemption`, it still needs some time and spends a lot of unnecessary cost.
    39  
    40  So we need a reservation design for resource requests and security reasons: `LendingLimit`, to claim the quota allowed to lend, reserve a certain amount of resources to ensure that they will never be borrowed.
    41  
    42  ### Goals
    43  
    44  - Implement `LendingLimit`, users can have a reservation of guaranteed resource by claiming the `LendingLimit`.
    45  
    46  ### Non-Goals
    47  
    48  - Replace `BorrowingLimit` to some extent in the future.
    49  
    50  ## Proposal
    51  
    52  In this proposal, `LendingLimit` is defined. The `ClusterQueue` will be limited to lend the specified quota to other ClusterQueues in the same cohort.
    53  
    54  ![Semantics](LendingLimit.png "Semantics of lendingLimit")
    55  
    56  ### User Stories (Optional)
    57  
    58  #### Story 1
    59  
    60  In order to ensure the full utilization of resources, we generally set `BorrowingLimit` to max, but this may cause a `ClusterQueue` to run out of its all resources, and make the new incoming job slow to response for the slow preemption. This could be worse in a competitive cluster, jobs will borrow resources and be reclaimed over and over.
    61  
    62  So we want to reserve some resources for a `ClusterQueue`, so that any incoming jobs in the  `ClusterQueue` can get admitted immediately.
    63  
    64  ### Notes/Constraints/Caveats (Optional)
    65  
    66  With both BorrowingLimit and LendingLimit configured, one clusterQueue may not be able to borrow up to the limit just because we reserved the lending limit quota of resource.
    67  
    68  ### Risks and Mitigations
    69  
    70  None.
    71  
    72  ## Design Details
    73  
    74  ### Kueue LendingLimit API
    75  
    76  Modify ResourceQuota API object:
    77  
    78  ```go
    79  type ResourceQuota struct {
    80  	[...]
    81  
    82  	// lendingLimit is the maximum amount of unused quota for the [flavor, resource]
    83  	// combination that this ClusterQueue can lend to other ClusterQueues in the same cohort.
    84  	// In total, at a given time, ClusterQueue reserves for its exclusive use
    85  	// a quantity of quota equals to nominalQuota - lendingLimit.
    86  	// If null, it means that there is no lending limit.
    87  	// If not null, it must be non-negative.
    88  	// lendingLimit must be null if spec.cohort is empty.
    89  	// +optional
    90  	LendingLimit *resource.Quantity `json:"lendingLimit,omitempty"`
    91  }
    92  ```
    93  
    94  #### Note
    95  
    96  We have considered adding this status field, but discarded it. Because unused resources from multiple CQs compose a single pool of shareable resources, we cannot precisely calculate this value.
    97  
    98  So there is no concept of A is borrowing from B. A is borrowing from all the unused resource of B, C and any other CQs in the cohort.
    99  
   100  ```go
   101  type ResourceUsage struct {
   102  	[...]
   103  
   104      // Lended is a quantity of quota that is lended to other ClusterQueues in the cohort.
   105      Lended resource.Quantity `json:"lended,omitempty"`
   106  }
   107  ```
   108  
   109  ### Test Plan
   110  
   111  [x] I/we understand the owners of the involved components may require updates to
   112  existing tests to make this code solid enough prior to committing the changes necessary
   113  to implement this enhancement.
   114  
   115  ##### Prerequisite testing updates
   116  
   117  None.
   118  
   119  #### Unit Tests
   120  
   121  - `pkg/cache`: `2023-11-15` - `86.5%`
   122  - `pkg/controller/core/`: `2023-11-15` - `16.5%`
   123  - `pkg/metrics/`: `2023-11-15` - `45.7%`
   124  - `pkg/scheduler/`: `2023-11-15` - `80.6%`
   125  - `pkg/scheduler/flavorassigner`: `2023-11-15` - `80.5%`
   126  - `pkg/scheduler/preemption`: `2023-11-15` - `94.0%`
   127  
   128  #### Integration tests
   129  
   130  <!--
   131  Describe what tests will be added to ensure proper quality of the enhancement.
   132  
   133  After the implementation PR is merged, add the names of the tests here.
   134  -->
   135  
   136  - No new workloads can be admitted when the `LendingLimit` greater than `NominalQuota` or less than `0`.
   137  - In a cohort with 2 ClusterQueues a, b and single ResourceFlavor:
   138    - When cq-b's LendingLimit set:
   139      - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`.
   140      - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`.
   141  - In a cohort with 3 ClusterQueues a, b, c and single ResourceFlavor:
   142    - When cq-b's LendingLimit set, cq-c's LendingLimit unset:
   143      - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's NominalQuota)`.
   144      - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's NominalQuota), cq-a's BorrowingLimit)`.
   145    - When cq-b and cq-c's LendingLimit both set:
   146      - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's LendingLimit)`.
   147      - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's LendingLimit), cq-a's BorrowingLimit)`.
   148  - In a cohort with 2 ClusterQueues cq-a, cq-b and 2 ResourceFlavors rf-a, rf-b:
   149    - In cq-b, when rf-a's LendingLimit set, and cq-a's FlavorFungibility set to `whenCanBorrow: Borrow`:
   150      - In cq-a, when rf-a's BorrowingLimit unset, cq-a can borrow as much as `rf-a's LendingLimit`.
   151      - In cq-a, when rf-a's BorrowingLimit set, cq-a can borrow as much as `min(rf-a's LendingLimit, rf-a's BorrowingLimit)`.
   152  
   153  We will not consider the situation that **when cq-a's FlavorFungibility set to `whenCanBorrow: TryNextFlavor`**, since in this case, borrow will not happen.
   154  
   155  ### Graduation Criteria
   156  
   157  ## Implementation History
   158  
   159  ## Drawbacks
   160  
   161  ## Alternatives
   162  
   163  - `GuaranteedQuota` which defines the quota for reservation is functionally similar to `LendingLimit`, but to align with `BorrowingLimit`, we chose the latter.
   164