sigs.k8s.io/kueue@v0.6.2/keps/1284-cluster-queue-stop/README.md (about)

     1  # KEP-1284: Add a mechanism to stop a ClusterQueue.
     2  <!-- toc -->
     3  - [Summary](#summary)
     4  - [Motivation](#motivation)
     5    - [Goals](#goals)
     6    - [Non-Goals](#non-goals)
     7  - [Proposal](#proposal)
     8    - [User Stories](#user-stories)
     9      - [Story 1](#story-1)
    10    - [Notes/Constraints/Caveats](#notesconstraintscaveats)
    11    - [Risks and Mitigations](#risks-and-mitigations)
    12  - [Design Details](#design-details)
    13    - [API/ClusterQueue](#apiclusterqueue)
    14    - [Controllers](#controllers)
    15      - [ClusterQueue](#clusterqueue)
    16      - [Workload](#workload)
    17    - [Test Plan](#test-plan)
    18        - [Prerequisite testing updates](#prerequisite-testing-updates)
    19      - [Unit Tests](#unit-tests)
    20      - [Integration tests](#integration-tests)
    21    - [Graduation Criteria](#graduation-criteria)
    22  - [Implementation History](#implementation-history)
    23  - [Drawbacks](#drawbacks)
    24  - [Alternatives](#alternatives)
    25  <!-- /toc -->
    26  
    27  ## Summary
    28  Add setting in a ClusterQueue that an administrator is able to use in order to pause new admissions and have the option to cancel current QuotaReservations and Evict admitted workloads.
    29  
    30  ## Motivation
    31  
    32  This is a common admin journey to control usage from a user.
    33  
    34  ### Goals
    35  
    36  Add a setting in a ClusterQueue that an administrator is able to use in order to to pause new admissions and have the option to cancel current QuotaReservations and Evict admitted workloads.
    37  
    38  ### Non-Goals
    39  
    40  Manage the QuotaReservation and Admission of workloads from the same cohort that might borrow resources from the ClusterQueue in question.
    41  
    42  ## Proposal
    43  
    44  Add a new member in the ClusterQueue implementation `stopPolicy` the presence of which will mark the ClusterQueue as Inactive and it's value will control how the `Admitted` or `Reserving` workloads are affected.
    45  
    46  ### User Stories
    47  #### Story 1
    48  
    49  As a cluster administrator I want to be able to stop the new admissions in a specific ClusterQueue with the option of Evicting currently admitted Workloads or canceling QuotaReservations.
    50  
    51  ### Notes/Constraints/Caveats
    52  Managing the Reservation canceling and Eviction of workloads in other queues from the same cohort that
    53  are potentially borrowing resources from the stopped queue adds a considerable amount of complexity
    54  while having a limited added value, therefore these cases are not covered in this first iteration. 
    55  
    56  ### Risks and Mitigations
    57  
    58  ## Design Details
    59  
    60  ### API/ClusterQueue
    61  
    62  ```go
    63  type ClusterQueueSpec struct {
    64  	// ....
    65  
    66  	// stopPolicy - if set the ClusterQueue is considered Inactive, no new reservation being
    67  	// made. 
    68  	//
    69  	// Depending on its value, its associated workloads will:
    70  	//
    71  	// - None - Workloads are admitted
    72  	// - HoldAndDrain - Admitted workloads are evicted and Reserving workloads will cancel the reservation.
    73  	// - Hold - Admitted workloads will run to completion and Reserving workloads will cancel the reservation.
    74  	//
    75  	// +kubebuilder:validation:Enum=None;Hold;HoldAndDrain
    76  	// +kubebuilder:default="None"
    77  	StopPolicy StopPolicy `json:"stopPolicy,omitempty"`
    78  }
    79  
    80  type StopPolicy string
    81  
    82  const (
    83  	None         StopPolicy = "None"
    84  	Hold         StopPolicy = "Hold"
    85  	HoldAndDrain StopPolicy = "HoldAndDrain"
    86  )
    87  
    88  
    89  ```
    90  ### Controllers
    91  #### ClusterQueue
    92  
    93  Once the `stopPolicy` is set the cluster queue is marked as inactive with a relevant status message.
    94  
    95  #### Workload
    96  
    97  If the cluster queue associated to a workload has the `stopPolicy` changed depending on the policy value and state of the
    98  workload it should Evict or cancel the reservation of the workload.
    99  
   100  ### Test Plan
   101  
   102  
   103  [x] I/we understand the owners of the involved components may require updates to
   104  existing tests to make this code solid enough prior to committing the changes necessary
   105  to implement this enhancement.
   106  
   107  ##### Prerequisite testing updates
   108  
   109  
   110  #### Unit Tests
   111  
   112  To be added depending on the added code complexity.
   113  
   114  #### Integration tests
   115  
   116  The `controllers/core` suite should check:
   117  
   118  1. ClusterQueue - Once the `stopPolicy` is set a ClusterQueue becomes Inactive.
   119  2. Workload - Once its ClusterQueue `stopPolicy` is set, depending on the value:
   120  - The Reserving workloads are canceling the reservation.
   121  - The Admitted workloads get Evicted and the Reserving ones cancel their reservation.
   122  - New workload is not admitted when cluster queue is inactive
   123  
   124  ### Graduation Criteria
   125  
   126  
   127  ## Implementation History
   128  
   129  
   130  ## Drawbacks
   131  
   132  
   133  ## Alternatives
   134