sigs.k8s.io/kueue@v0.6.2/keps/1224-lending-limit/README.md (about) 1 # KEP-1224: Introducing lendingLimit to help reserve guaranteed resources 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories (Optional)](#user-stories-optional) 10 - [Story 1](#story-1) 11 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 12 - [Risks and Mitigations](#risks-and-mitigations) 13 - [Design Details](#design-details) 14 - [Kueue LendingLimit API](#kueue-lendinglimit-api) 15 - [Note](#note) 16 - [Test Plan](#test-plan) 17 - [Prerequisite testing updates](#prerequisite-testing-updates) 18 - [Unit Tests](#unit-tests) 19 - [Integration tests](#integration-tests) 20 - [Graduation Criteria](#graduation-criteria) 21 - [Implementation History](#implementation-history) 22 - [Drawbacks](#drawbacks) 23 - [Alternatives](#alternatives) 24 <!-- /toc --> 25 26 ## Summary 27 28 Under the current implementation, one ClusterQueue's resources could be borrowed completely by others in the same cohort, this improves the resource utilization to some extent, but sometimes, user wants to reserve some resources only for private usage. 29 30 This proposal provides a guarantee mechanism for users to solve this problem. They can have a reservation of resource quota that will never be borrowed by other clusterqueues in the same cohort. 31 32 ## Motivation 33 34 Sometimes we want to keep some resources for guaranteed usage, so that when new jobs come into queue, they can be admitted immediately. 35 36 Under the current implementation, we are using `BorrowingLimit` to define the maximum amount of quota that this ClusterQueue is allowed to borrow. But this may cause another ClusterQueue in the same cohort to run out of resources. 37 38 Even if we set the `Preemption`, it still needs some time and spends a lot of unnecessary cost. 39 40 So we need a reservation design for resource requests and security reasons: `LendingLimit`, to claim the quota allowed to lend, reserve a certain amount of resources to ensure that they will never be borrowed. 41 42 ### Goals 43 44 - Implement `LendingLimit`, users can have a reservation of guaranteed resource by claiming the `LendingLimit`. 45 46 ### Non-Goals 47 48 - Replace `BorrowingLimit` to some extent in the future. 49 50 ## Proposal 51 52 In this proposal, `LendingLimit` is defined. The `ClusterQueue` will be limited to lend the specified quota to other ClusterQueues in the same cohort. 53 54  55 56 ### User Stories (Optional) 57 58 #### Story 1 59 60 In order to ensure the full utilization of resources, we generally set `BorrowingLimit` to max, but this may cause a `ClusterQueue` to run out of its all resources, and make the new incoming job slow to response for the slow preemption. This could be worse in a competitive cluster, jobs will borrow resources and be reclaimed over and over. 61 62 So we want to reserve some resources for a `ClusterQueue`, so that any incoming jobs in the `ClusterQueue` can get admitted immediately. 63 64 ### Notes/Constraints/Caveats (Optional) 65 66 With both BorrowingLimit and LendingLimit configured, one clusterQueue may not be able to borrow up to the limit just because we reserved the lending limit quota of resource. 67 68 ### Risks and Mitigations 69 70 None. 71 72 ## Design Details 73 74 ### Kueue LendingLimit API 75 76 Modify ResourceQuota API object: 77 78 ```go 79 type ResourceQuota struct { 80 [...] 81 82 // lendingLimit is the maximum amount of unused quota for the [flavor, resource] 83 // combination that this ClusterQueue can lend to other ClusterQueues in the same cohort. 84 // In total, at a given time, ClusterQueue reserves for its exclusive use 85 // a quantity of quota equals to nominalQuota - lendingLimit. 86 // If null, it means that there is no lending limit. 87 // If not null, it must be non-negative. 88 // lendingLimit must be null if spec.cohort is empty. 89 // +optional 90 LendingLimit *resource.Quantity `json:"lendingLimit,omitempty"` 91 } 92 ``` 93 94 #### Note 95 96 We have considered adding this status field, but discarded it. Because unused resources from multiple CQs compose a single pool of shareable resources, we cannot precisely calculate this value. 97 98 So there is no concept of A is borrowing from B. A is borrowing from all the unused resource of B, C and any other CQs in the cohort. 99 100 ```go 101 type ResourceUsage struct { 102 [...] 103 104 // Lended is a quantity of quota that is lended to other ClusterQueues in the cohort. 105 Lended resource.Quantity `json:"lended,omitempty"` 106 } 107 ``` 108 109 ### Test Plan 110 111 [x] I/we understand the owners of the involved components may require updates to 112 existing tests to make this code solid enough prior to committing the changes necessary 113 to implement this enhancement. 114 115 ##### Prerequisite testing updates 116 117 None. 118 119 #### Unit Tests 120 121 - `pkg/cache`: `2023-11-15` - `86.5%` 122 - `pkg/controller/core/`: `2023-11-15` - `16.5%` 123 - `pkg/metrics/`: `2023-11-15` - `45.7%` 124 - `pkg/scheduler/`: `2023-11-15` - `80.6%` 125 - `pkg/scheduler/flavorassigner`: `2023-11-15` - `80.5%` 126 - `pkg/scheduler/preemption`: `2023-11-15` - `94.0%` 127 128 #### Integration tests 129 130 <!-- 131 Describe what tests will be added to ensure proper quality of the enhancement. 132 133 After the implementation PR is merged, add the names of the tests here. 134 --> 135 136 - No new workloads can be admitted when the `LendingLimit` greater than `NominalQuota` or less than `0`. 137 - In a cohort with 2 ClusterQueues a, b and single ResourceFlavor: 138 - When cq-b's LendingLimit set: 139 - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`. 140 - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`. 141 - In a cohort with 3 ClusterQueues a, b, c and single ResourceFlavor: 142 - When cq-b's LendingLimit set, cq-c's LendingLimit unset: 143 - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's NominalQuota)`. 144 - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's NominalQuota), cq-a's BorrowingLimit)`. 145 - When cq-b and cq-c's LendingLimit both set: 146 - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's LendingLimit)`. 147 - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's LendingLimit), cq-a's BorrowingLimit)`. 148 - In a cohort with 2 ClusterQueues cq-a, cq-b and 2 ResourceFlavors rf-a, rf-b: 149 - In cq-b, when rf-a's LendingLimit set, and cq-a's FlavorFungibility set to `whenCanBorrow: Borrow`: 150 - In cq-a, when rf-a's BorrowingLimit unset, cq-a can borrow as much as `rf-a's LendingLimit`. 151 - In cq-a, when rf-a's BorrowingLimit set, cq-a can borrow as much as `min(rf-a's LendingLimit, rf-a's BorrowingLimit)`. 152 153 We will not consider the situation that **when cq-a's FlavorFungibility set to `whenCanBorrow: TryNextFlavor`**, since in this case, borrow will not happen. 154 155 ### Graduation Criteria 156 157 ## Implementation History 158 159 ## Drawbacks 160 161 ## Alternatives 162 163 - `GuaranteedQuota` which defines the quota for reservation is functionally similar to `LendingLimit`, but to align with `BorrowingLimit`, we chose the latter. 164