sigs.k8s.io/kueue@v0.6.2/keps/993-two-phase-admission/README.md (about) 1 # KEP-993: Two Stage Admission Process 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories (Optional)](#user-stories-optional) 10 - [Story 1](#story-1) 11 - [Story 2](#story-2) 12 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 13 - [Risks and Mitigations](#risks-and-mitigations) 14 - [Design Details](#design-details) 15 - [Test Plan](#test-plan) 16 - [Prerequisite testing updates](#prerequisite-testing-updates) 17 - [Unit Tests](#unit-tests) 18 - [Integration tests](#integration-tests) 19 - [Graduation Criteria](#graduation-criteria) 20 - [Implementation History](#implementation-history) 21 - [Drawbacks](#drawbacks) 22 - [Alternatives](#alternatives) 23 <!-- /toc --> 24 25 ## Summary 26 27 This KEP defines the extension point to plug in arbitrary, additional checks 28 for workloads before they are admitted for execution. These external checks may 29 be implemented inside or outside of Kueue and may provide functionality like 30 budgeting, capacity provisioning or some elements of multicluster dispatching. 31 32 ## Motivation 33 34 With the growth of Kueue, not all desired elements may land inside of the core 35 Kueue. Some users may even not want to publish their custom admission logic in 36 a public repository. At the same time, we don’t want to force users to 37 fork Kueue. Thus, an pluggable mechanism for workload admission needs to 38 be established. 39 40 ### Goals 41 42 Define mechanism to allow external controllers: 43 44 * to give green light to admit a workload. 45 * to temporarily hold workload admission. 46 * to stop a previously admitted workload and move it back to the 47 suspended state. 48 49 ### Non-Goals 50 51 Provide a specific design or implementation for any external controller. 52 53 ## Proposal 54 55 Extend ClusterQueue definition with a list of external controller names that 56 need to give green light to admit a workload. Until all of the controllers, which 57 identify with these names, don’t give a green light, the workload will not 58 be admitted. 59 60 Each workload that goes to such a queue, will get additional conditions, 61 in a dedicated field in the status, each reflecting the go/no-go 62 decision of each of the external controllers. Once 63 a controller is confident that the workload can be admitted, it flips the 64 status of the corresponding condition from "Unknown" to "True". 65 66 If a controller changes its mind about a workload, it can switch the condition 67 back to false and the workload will be immediately suspended. 68 69 ### User Stories (Optional) 70 71 #### Story 1 72 73 I want to use the [Cluster Autoscaler ProvisioningRequest API](https://github.com/kubernetes/autoscaler/pull/5848) 74 to ensure that the resources can be provided in full before starting the workload. 75 76 #### Story 2 77 78 I want to base workload admission on budget. I have the actual consumption metrics 79 in my Prometheus/Datadog/Stackdriver/whatever monitoring system. And weekly 80 budget in a CRD, per namespace. I’m happy to modify some small existing sample 81 controller, but I don’t want to fork the whole Kueue to adjust some minor details. 82 83 ### Notes/Constraints/Caveats (Optional) 84 85 Each of the go/no-go decisions is best-effort and can be reversed at any time. 86 87 ### Risks and Mitigations 88 89 ## Design Details 90 91 TL;DR; 92 * extend ClusterQueueSpec with the list of checks. 93 * add extra conditions to Workload status. 94 * introduce new CRD, AdmissionCheck. 95 96 Details: 97 98 We will extend ClusterQueueSpec definition by adding a reference to 99 AdmissionChecks that the queue needs to perform before admitting a 100 workload. The design of AdmissionChecks is similar to IngressClass 101 - it allows passing additional parameters, specific to particular checks 102 that need to be performed. 103 104 ``` 105 // ClusterQueueSpec defines the desired state of ClusterQueue. 106 type ClusterQueueSpec struct { 107 [...] 108 // List of AdmissionCheck names that needs to be passed 109 // to admit a workload. The order of the checks doesn't matter, 110 // all of them will be started at the same time. 111 AdmissionChecks []string `json:"admissionChecks"` 112 } 113 114 // WorkloadStatus defines the observed state of Workload. 115 type WorkloadStatus struct { 116 [...] 117 // Status of admission checks, if there are any specified 118 // for the ClusterQueue in which the workload is queued. 119 AdmissionChecks []metav1.Condition `json:"admissionChecksConditions,omitempty"` 120 } 121 122 // Condition names used in WorkloadStatus 123 const ( 124 AdmissionPrecheck string = "AdmissionPrecheck" 125 ) 126 127 // Cluster-scoped top-level object defining a check that 128 // can be referenced in ClusterQueue. 129 type AdmissionCheck struct { 130 metav1.TypeMeta `json:",inline"` 131 metav1.ObjectMeta `json:"metadata,omitempty"` 132 Spec AdmissionCheckSpec `json:"spec,omitempty"` 133 } 134 135 // Condition Reasons used by AdmissionChecks. 136 const ( 137 // The check cannot pass at this moment, back off (possibly 138 // allowing other to try, unblock quota) and retry. Can be set only 139 // when Condition status is set to "False" 140 Retry string = "Retry" 141 // The check will not pass in the near future. It is not worth 142 // to retry. Can be set together with Condition status "False". 143 Reject string = "Reject" 144 // The check might pass if there was fewer workloads. Proceed 145 // with any pending workload preemption. Should be set 146 // when condition status is still "Unknown" 147 PreemptionRequired status = "PreemptionRequired" 148 ) 149 150 // AdmissionCheckSpec defines the desired state of AdmissionCheck. 151 type AdmissionCheckSpec struct { 152 // Name of the controller which will actually perform 153 // the checks. This is the name with which controller indentifies with, 154 // not a K8S pod or deployment name. Cannot be empty. 155 ControllerName string `json:"controllerName"` 156 157 // How long to keep the workload suspended 158 // after a failed check (after it transitioned to False). 159 // After that the check state goes to "Unknown". 160 // The default is 15 min. 161 RetryDelayMinutes *int64 `json:"retryDelayMinutes,omitempty"` 162 163 // A reference to the additional parameters for the check. 164 Parameters *AdmissionCheckParametersReference `json:"parameters,omitempty"` 165 166 // preemptionPolicy determines when to issue preemptions for the Workload, 167 // if necessary, in relationship to the status of the admission check. 168 // The possible values are: 169 // - `Anytime`: No need to wait for this check to pass before issuing preemptions. 170 // Preemptions might be blocked on the preemptionPolicy of other AdmissionChecks. 171 // - `AfterCheckPassedOrOnDemand`: Wait for this check to pass pass before issuing preemptions, 172 // unless this or other checks requests preemptions through the Workload's admissionChecks. 173 // Defaults to `Anytime`. 174 PreemptionPolicy *PreemptionPolicy `json:"preemptionPolicy,omitempty"` 175 } 176 177 const ( 178 Anytime PreemptionPolicy = "Anytime" 179 AfterCheckPassedOrOnDemand PreemptionPolicy = "AfterCheckPassedOrOnDemand" 180 ) 181 182 // Points to where the parameters for the checks are. 183 // As clusterqueue are in cluster scope - this should be 184 // a dedicated CRD specific for the Controller. 185 type AdmissionCheckParametersReference struct { 186 // ApiGroup is the group for the resource being referenced. 187 APIGroup string `json:"apiGroup"` 188 189 // Kind is the type of the resource being referenced. 190 Kind string `json:"kind"` 191 192 // Name is the name of the resource being referenced. 193 Name string `json:"name"` 194 } 195 ``` 196 197 For every workload that is put to a ClusterQueue that has AdmissionChecks 198 configured Kueue will add: 199 200 * "QuotaReserved" set `False` in Conditions. 201 * "<checkName>" set to `Unknown` for each of the AdmissionChecks to AdmissionCheckConditions. 202 203 Kueue will perform the very same checks that it does today, 204 before admitting a workload. However, once the basic checks 205 pass AND there are some AdmissionChecks configured, AND the 206 workload is not in on-hold retry state from some check, it will: 207 208 1. Fill the Admission field in workload, with the desired flavor assignment. 209 2. Not do any preemptions yet (unless BookCapacity is set to true). 210 3. Set "QuotaReserved" to true. 211 212 Kueue will only pass as many pods into "QuotaReserved" as there would 213 fit in the quota together, assuming that necessary preemptions will happen. 214 That would violate a bit BestEffort logic of queues - if there is 1000 tasks 215 in a queue that doesn’t pass quota and 1001st task passes, the best effort 216 queue would let it in. BestEffort queue with 1000 tasks that don’t pass 217 AdmissionChecks would not let 1001st task in (it will not reach the 218 checks). Without this limitation, a large number of tasks could be switched 219 back and forth from "QuotaReserved" to suspended state. 220 221 Preemptions might happen at this point or later, depending on the `preemptionPolicy` of each 222 AdmissionCheck. Preemption can happen immediately if all AdmissionChecks have a `preemptionPolicy` 223 of `Anytime`. Otherwise, it will happen as soon as: 224 - An AdmissionCheck requests preemptions using the reason `PreemptionRequired` in the check 225 status posted to the Workload's `.status.admissionChecks`, or 226 - All AdmissionChecks with a `preemptionPolicy` of `AfterCheckPassedOrOnDemand` are reported as 227 `True` in the Workload's `.status.admissionChecks`. 228 229 Note that all Workloads that have a condition `QuotaReserved` are candidates for preemption. If 230 any of these Workloads needs to be preempted: 231 1. the `QuotaReserved` condition is set to `False` 232 2. `.status.admissionChecks` is cleared. 233 3. Controllers for the admission checks should stop any operations started for this Workload. 234 235 Once all admission checks are satisfied (set to `True`), Kueue will recheck that Admission settings 236 are still valid (check quota/preemptions/reclamation) and admit the workload (doing any pending or 237 newly identified preemptions, if needed), setting the `Admitted` 238 condition to True. 239 240 If any check is switched to "False", with reason "Reject" or "Retry", the workload is either 241 completely rejected or switched back to suspend state (with retry delay). 242 Setting reason to BookQuotaRequired (while keeping condition as "Unknown") 243 means that some other, external quota is reached 244 and Kueue should try to preempt some workloads, before the check can 245 succeed. 246 247 The controller implementing a particular check should: 248 249 * Watch all AdmissionCheck objects to know which one should be handled by it. 250 * Watch all controller specific parameter objects, potentially referenced from AdmissionCheck. 251 * Watch all workloads and process those that have AdmissionCheck for this 252 particular controller and are past AdmissionPrecheck. 253 * After approving the workload, keep an eye on the check if it starts failing, 254 fail the check and cause the workload to move back to the suspended state. 255 256 ### Test Plan 257 258 [ x ] I/we understand the owners of the involved components may require updates to 259 existing tests to make this code solid enough prior to committing the changes necessary 260 to implement this enhancement. 261 262 ##### Prerequisite testing updates 263 264 #### Unit Tests 265 266 #### Integration tests 267 268 The tests should cover: 269 270 * Transition from `Suspended` to `Prechecked`, with `AdmissionChecks` added. 271 * Transition from `Prechecked` and `AdmissionCheck` passed back to `Admitted` 272 * Transition from `Prechecked` back to `Suspended` because of quota change 273 * Transition from `Prechecked` and AdmissionCheck passed back to `Suspended` because of different resource flavors available. 274 * Transition from `Prechecked` to `Suspended` due to Retry with 1 and 2 AdmissionChecks. 275 * Transition from `Prechecked` to `Rejected` with 1 and 2 AdmissionChecks (only one is required to reject a workload) 276 * Transition from `Prechecked` to `Admitted` with 0 AdmissionChecks. 277 278 ### Graduation Criteria 279 280 ## Implementation History 281 282 ## Drawbacks 283 284 ## Alternatives 285 286