sigs.k8s.io/kueue@v0.6.2/keps/993-two-phase-admission/README.md (about)

     1  # KEP-993: Two Stage Admission Process
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11      - [Story 2](#story-2)
    12    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    13    - [Risks and Mitigations](#risks-and-mitigations)
    14  - [Design Details](#design-details)
    15    - [Test Plan](#test-plan)
    16        - [Prerequisite testing updates](#prerequisite-testing-updates)
    17      - [Unit Tests](#unit-tests)
    18      - [Integration tests](#integration-tests)
    19    - [Graduation Criteria](#graduation-criteria)
    20  - [Implementation History](#implementation-history)
    21  - [Drawbacks](#drawbacks)
    22  - [Alternatives](#alternatives)
    23  <!-- /toc -->
    24  
    25  ## Summary
    26  
    27  This KEP defines the extension point to plug in arbitrary, additional checks
    28  for workloads before they are admitted for execution. These external checks may
    29  be implemented inside or outside of Kueue and may provide functionality like
    30  budgeting, capacity provisioning or some elements of multicluster dispatching.
    31  
    32  ## Motivation
    33  
    34  With the growth of Kueue, not all desired elements may land inside of the core
    35  Kueue. Some users may even not want to publish their custom admission logic in
    36  a public repository. At the same time, we don’t want to force users to
    37  fork Kueue. Thus, an pluggable mechanism for workload admission needs to
    38  be established.
    39  
    40  ### Goals
    41  
    42  Define mechanism to allow external controllers:
    43  
    44  * to give green light to admit a workload.
    45  * to temporarily hold workload admission.
    46  * to stop a previously admitted workload and move it back to the
    47  suspended state.
    48  
    49  ### Non-Goals
    50  
    51  Provide a specific design or implementation for any external controller.
    52  
    53  ## Proposal
    54  
    55  Extend ClusterQueue definition with a list of external controller names that
    56  need to give green light to admit a workload. Until all of the controllers, which
    57  identify with these names, don’t give a green light, the workload will not
    58  be admitted. 
    59  
    60  Each workload that goes to such a queue, will get additional conditions, 
    61  in a dedicated field in the status, each reflecting the go/no-go 
    62  decision of each of the external controllers. Once
    63  a controller is confident that the workload can be admitted, it flips the
    64  status of the corresponding condition from "Unknown" to "True". 
    65  
    66  If a controller changes its mind about a workload, it can switch the condition
    67  back to false and the workload will be immediately suspended.
    68  
    69  ### User Stories (Optional)
    70  
    71  #### Story 1
    72  
    73  I want to use the [Cluster Autoscaler ProvisioningRequest API](https://github.com/kubernetes/autoscaler/pull/5848) 
    74  to ensure that the resources can be provided in full before starting the workload.  
    75  
    76  #### Story 2
    77  
    78  I want to base workload admission on budget. I have the actual consumption metrics
    79  in my Prometheus/Datadog/Stackdriver/whatever monitoring system. And weekly
    80  budget in a CRD, per namespace. I’m happy to modify some small existing sample
    81  controller, but I don’t want to fork the whole Kueue to adjust some minor details.
    82   
    83  ### Notes/Constraints/Caveats (Optional)
    84  
    85  Each of the go/no-go decisions is best-effort and can be reversed at any time. 
    86  
    87  ### Risks and Mitigations
    88  
    89  ## Design Details
    90  
    91  TL;DR; 
    92  * extend ClusterQueueSpec with the list of checks.
    93  * add extra conditions to Workload status.
    94  * introduce new CRD, AdmissionCheck.
    95  
    96  Details:
    97  
    98  We will extend ClusterQueueSpec definition by adding a reference to
    99  AdmissionChecks that the queue needs to perform before admitting a
   100  workload. The design of AdmissionChecks is similar to IngressClass
   101  - it allows passing additional parameters, specific to particular checks
   102  that need to be performed. 
   103  
   104  ```
   105  // ClusterQueueSpec defines the desired state of ClusterQueue.
   106  type ClusterQueueSpec struct {
   107  [...]
   108  	// List of AdmissionCheck names that needs to be passed
   109  	// to admit a workload. The order of the checks doesn't matter,
   110  	// all of them will be started at the same time.
   111  	AdmissionChecks []string `json:"admissionChecks"`
   112  }
   113  
   114  // WorkloadStatus defines the observed state of Workload.
   115  type WorkloadStatus struct {
   116  [...]
   117  	// Status of admission checks, if there are any specified 
   118  	// for the ClusterQueue in which the workload is queued.
   119  	AdmissionChecks []metav1.Condition `json:"admissionChecksConditions,omitempty"`
   120  }
   121  
   122  // Condition names used in WorkloadStatus
   123  const (
   124  	AdmissionPrecheck string = "AdmissionPrecheck"
   125  )
   126  
   127  // Cluster-scoped top-level object defining a check that 
   128  // can be referenced in ClusterQueue.
   129  type AdmissionCheck struct {
   130  	metav1.TypeMeta   `json:",inline"`
   131  	metav1.ObjectMeta `json:"metadata,omitempty"`
   132  	Spec AdmissionCheckSpec `json:"spec,omitempty"`
   133  }
   134  
   135  // Condition Reasons used by AdmissionChecks.
   136  const (
   137  	// The check cannot pass at this moment, back off (possibly 
   138  	// allowing other to try, unblock quota) and retry. Can be set only
   139  	// when Condition status is set to "False"
   140  	Retry string = "Retry"
   141  	// The check will not pass in the near future. It is not worth
   142  	// to retry. Can be set together with Condition status "False".
   143  	Reject string = "Reject"
   144  	// The check might pass if there was fewer workloads. Proceed
   145  	// with any pending workload preemption. Should be set 
   146  	// when condition status is still "Unknown" 
   147  	PreemptionRequired status = "PreemptionRequired"
   148  )
   149  
   150  // AdmissionCheckSpec defines the desired state of AdmissionCheck.
   151  type AdmissionCheckSpec struct {
   152  	// Name of the controller which will actually perform
   153  	// the checks. This is the name with which controller indentifies with,
   154  	// not a K8S pod or deployment name. Cannot be empty. 
   155  	ControllerName string `json:"controllerName"`
   156  
   157  	// How long to keep the workload suspended 
   158  	// after a failed check (after it transitioned to False).
   159  	// After that the check state goes to "Unknown".
   160  	// The default is 15 min. 
   161  	RetryDelayMinutes *int64 `json:"retryDelayMinutes,omitempty"`
   162  
   163  	// A reference to the additional parameters for the check. 
   164  	Parameters *AdmissionCheckParametersReference `json:"parameters,omitempty"`
   165  
   166  	// preemptionPolicy determines when to issue preemptions for the Workload,
   167  	// if necessary, in relationship to the status of the admission check.
   168  	// The possible values are:
   169  	// - `Anytime`: No need to wait for this check to pass before issuing preemptions.
   170  	//   Preemptions might be blocked on the preemptionPolicy of other AdmissionChecks.
   171  	// - `AfterCheckPassedOrOnDemand`: Wait for this check to pass pass before issuing preemptions,
   172  	//   unless this or other checks requests preemptions through the Workload's admissionChecks.
   173  	// Defaults to `Anytime`.
   174  	PreemptionPolicy *PreemptionPolicy `json:"preemptionPolicy,omitempty"`
   175  }
   176  
   177  const (
   178  	Anytime PreemptionPolicy = "Anytime"
   179  	AfterCheckPassedOrOnDemand PreemptionPolicy = "AfterCheckPassedOrOnDemand"
   180  )
   181  
   182  // Points to where the parameters for the checks are. 
   183  // As clusterqueue are in cluster scope - this should be
   184  // a dedicated CRD specific for the Controller.
   185  type AdmissionCheckParametersReference struct {
   186  	// ApiGroup is the group for the resource being referenced. 
   187  	APIGroup string `json:"apiGroup"`
   188  
   189  	// Kind is the type of the resource being referenced.
   190  	Kind string `json:"kind"`
   191  
   192  	// Name is the name of the resource being referenced.
   193  	Name string `json:"name"`
   194  }
   195  ```
   196  
   197  For every workload that is put to a ClusterQueue that has AdmissionChecks
   198  configured Kueue will add:
   199  
   200  * "QuotaReserved" set `False` in Conditions.
   201  * "<checkName>" set to `Unknown` for each of the AdmissionChecks to AdmissionCheckConditions.
   202  
   203  Kueue will perform the very same checks that it does today, 
   204  before admitting a workload. However, once the basic checks 
   205  pass AND there are some AdmissionChecks configured, AND the
   206  workload is not in on-hold retry state from some check, it will:
   207  
   208  1. Fill the Admission field in workload, with the desired flavor assignment. 
   209  2. Not do any preemptions yet (unless BookCapacity is set to true).
   210  3. Set "QuotaReserved" to true.
   211  
   212  Kueue will only pass as many pods into "QuotaReserved" as there would
   213  fit in the quota together, assuming that necessary preemptions will happen.
   214  That would violate a bit BestEffort logic of queues - if there is 1000 tasks
   215  in a queue that doesn’t pass quota and 1001st task passes, the best effort
   216  queue would let it in. BestEffort queue with 1000 tasks that don’t pass
   217  AdmissionChecks would not let 1001st task in (it will not reach the
   218  checks). Without this limitation, a large number of tasks could be switched
   219  back and forth from "QuotaReserved" to suspended state. 
   220  
   221  Preemptions might happen at this point or later, depending on the `preemptionPolicy` of each
   222  AdmissionCheck. Preemption can happen immediately if all AdmissionChecks have a `preemptionPolicy`
   223  of `Anytime`. Otherwise, it will happen as soon as:
   224  - An AdmissionCheck requests preemptions using the reason `PreemptionRequired` in the check
   225    status posted to the Workload's `.status.admissionChecks`, or
   226  - All AdmissionChecks with a `preemptionPolicy` of `AfterCheckPassedOrOnDemand` are reported as
   227    `True` in the Workload's `.status.admissionChecks`.
   228  
   229  Note that all Workloads that have a condition `QuotaReserved` are candidates for preemption. If
   230  any of these Workloads needs to be preempted:
   231  1. the `QuotaReserved` condition is set to `False`
   232  2. `.status.admissionChecks` is cleared.
   233  3. Controllers for the admission checks should stop any operations started for this Workload.
   234  
   235  Once all admission checks are satisfied (set to `True`), Kueue will recheck that Admission settings
   236  are still valid (check quota/preemptions/reclamation) and admit the workload (doing any pending or
   237  newly identified preemptions, if needed), setting the `Admitted`
   238  condition to True.
   239  
   240  If any check is switched to "False", with reason "Reject" or "Retry", the workload is either
   241  completely rejected or switched back to suspend state (with retry delay).
   242  Setting reason to BookQuotaRequired (while keeping condition as "Unknown")
   243  means that some other, external quota is reached
   244  and Kueue should try to preempt some workloads, before the check can
   245  succeed.
   246  
   247  The controller implementing a particular check should:
   248   
   249  * Watch all AdmissionCheck objects to know which one should be handled by it. 
   250  * Watch all controller specific parameter objects, potentially referenced from AdmissionCheck.
   251  * Watch all workloads and process those that have AdmissionCheck for this
   252    particular controller and are past AdmissionPrecheck.
   253  * After approving the workload, keep an eye on the check if it starts failing,
   254    fail the check and cause the workload to move back to the suspended state.
   255  
   256  ### Test Plan
   257  
   258  [ x ] I/we understand the owners of the involved components may require updates to
   259  existing tests to make this code solid enough prior to committing the changes necessary
   260  to implement this enhancement.
   261  
   262  ##### Prerequisite testing updates
   263  
   264  #### Unit Tests
   265  
   266  #### Integration tests
   267  
   268  The tests should cover:
   269  
   270  * Transition from `Suspended` to `Prechecked`, with `AdmissionChecks` added.
   271  * Transition from `Prechecked` and `AdmissionCheck` passed back to `Admitted`
   272  * Transition from `Prechecked` back to `Suspended` because of quota change
   273  * Transition from `Prechecked` and AdmissionCheck passed back to `Suspended` because of different resource flavors available. 
   274  * Transition from `Prechecked` to `Suspended` due to Retry with 1 and 2 AdmissionChecks.
   275  * Transition from `Prechecked` to `Rejected` with 1 and 2 AdmissionChecks (only one is required to reject a workload)
   276  * Transition from `Prechecked` to `Admitted` with 0 AdmissionChecks.
   277  
   278  ### Graduation Criteria
   279  
   280  ## Implementation History
   281  
   282  ## Drawbacks
   283  
   284  ## Alternatives
   285  
   286