sigs.k8s.io/kueue@v0.6.2/keps/693-multikueue/README.md (about)

     1  # KEP-693: MultiKueue
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11      - [Story 2](#story-2)
    12    - [Risks and Mitigations](#risks-and-mitigations)
    13  - [Design Details](#design-details)
    14      - [Follow ups ideas](#follow-ups-ideas)
    15    - [Test Plan](#test-plan)
    16      - [Unit Tests](#unit-tests)
    17      - [Integration tests](#integration-tests)
    18      - [E2E tests](#e2e-tests)
    19    - [Graduation Criteria](#graduation-criteria)
    20  - [Implementation History](#implementation-history)
    21  - [Drawbacks](#drawbacks)
    22  - [Alternatives](#alternatives)
    23  <!-- /toc -->
    24  
    25  ## Summary
    26  Introduce an new AdmissionCheck (called MultiKueue) with dedicated API
    27  and controller that will provide multi-cluster capabilities to Kueue.
    28  
    29  ## Motivation
    30  Many of Kueue's users are running multiple clusters and would like to
    31  have a way to easily distribute batch jobs across them to keep all of
    32  them utilized. Without a global distribution point, some clusters may
    33  get less jobs they are able to process, while the others get more, leading
    34  to underutilization and higher costs. 
    35  
    36  ### Goals
    37  * Allow Kueue to distribute batch jobs across multiple clusters,
    38  while maintaining the specified quota limits.
    39  * Provide users with a single entry point through which the jobs
    40  can be submitted and monitored, just like they were running in
    41  a single cluster.
    42  * Be compatible with all Kueue's features (priorities, borrowing, preemptions, etc)
    43  and most of integrations.
    44  * Allow to upgrade single cluster Kueue deployments to multicluster without
    45  much hassle.
    46  
    47  ### Non-Goals
    48  * Solve storage problem. It is assumed that the distributed jobs are
    49  either location-flexible (for a subset of clusters) or are copying the 
    50  data as a part of the startup process.
    51  * Automatically detect and configure new clusters.
    52  * Synchronize configuration across the clusters. It is expected that the 
    53  user will create the appropriate objects, roles and permissions
    54  in the clusters (manually, using gitops or some 3rd-party tooling).
    55  * Set up authentication between clusters.
    56  * Support very high job troughput (>1M jobs/day).
    57  * Support K8S Jobs on management clusters that don't have either 
    58  kubernetes/enhancements#4370 implemented or Job controller disabled.
    59  * Support for cluster role sharing (worker & manager insinde one cluster)
    60  is out of scope for this KEP. We will get back to the topic once 
    61  kubernetes/enhancements#4370 is merged and becomes a wider standard.
    62  * distribute running Jobs across multiple clusters, and reconcile partial
    63  results in the Job objects on the management cluster (each Job will run on
    64  a single worker cluster).
    65  
    66  ## Proposal
    67  
    68  Introduce MultiKueue AdmissionCheck, controller and configuration API. 
    69  
    70  Establish the need for a designated management cluster.
    71  
    72  ![Architecture](arch.png "Architecture")
    73  
    74  For each workload coming to a ClusterQueue (with the MultiKueue AdmissonCheck enabled)
    75  in the management cluster, and getting past the preadmission phase in the 
    76  two-phase admission process (meaning that the global quota - total amount resources
    77  that can be consumed across all clusters - is ok), 
    78  MultiKueue controller will clone it in the defined worker clusters and wait 
    79  until some Kueue running there admits the workload.
    80  If a remote workload is admitted first, the job will be created 
    81  in the remote cluster with a `kueue.x-k8s.io/prebuilt-workload-name` label pointing to that clone.
    82  Then it will remove the workloads from the remaining worker clusters and allow the
    83  single instance of the job to proceed. The workload will be also admitted in 
    84  the management cluster.
    85  
    86  There will be no job controllers running in the management clusters or they will be
    87  disabled for the workloads coming to MultiKueue-enabled cluster queues via annotation
    88  or some other, yet to be decided, mechanism. By disabling we mean that the controller
    89  will do no action on the selected objects, no pods (or other objects created and 
    90  allowing other controllets to update their status as they see fit.
    91  
    92  There will be just CRD/job definitions deployed. MultiKueue controller will copy the status
    93  of the job from the worker clusters, so that it will appear that the job
    94  is running inside the management clusters. However, as there is no job controller,
    95  no pods will be created in the management cluster. No controller will also overwrite
    96  the status that will be copied by the MultiKueue controller. 
    97  
    98  If the job, for whatever reason, is suspended or deleted in the management cluster,
    99  it will be deleted from the worker cluster. Deletion/suspension of the job 
   100  only in worker cluster will trigger the global job requeuing.
   101  Once the job finishes in the worker cluster, the job will also 
   102  finish in the management cluster.
   103  
   104  ### User Stories (Optional)
   105  
   106  #### Story 1
   107  As a Kueue user I have clusters on different cloud providers and on-prem. 
   108  I would like to run computation-heavy jobs across all of them, wherever 
   109  I have free resources.
   110   
   111  #### Story 2
   112  As a Kueue user I have clusters in multiple regions of the same cloud 
   113  provider. I would like to run workloads that require the newest GPUs,
   114  whose on-demand availability is very volatile. The GPUs are available
   115  at random times at random regions. I want to use ProvisioningRequest
   116  to try to catch them.
   117  
   118  ### Risks and Mitigations
   119  * Disabling the Job controller for all (or selected objects) may be problematic
   120  on environments where access to the master configuration is limited (like GKE).
   121  We are working on kubernetes/enhancements#4370
   122  to establish an acceptable way of using a non default controller (or none at all).
   123  
   124  * etcd may not provide enough performance (writes/s) to handle very large 
   125  deployments with very high job troughput (above 1M jobs per day).
   126  
   127  * Management cluster could be a single point of failure. The mitigations include:
   128    * Running multiple managmenet clusters with infinite global quotas and 
   129      correct, limiting worker cluster local quotas.
   130    * Running multiple managment clusters, with one leader and back-up clusters
   131      learning the state of the world from the worker clusters (not covered
   132      by this KEP).
   133  
   134  ## Design Details
   135  MultiKueue will be enabled on a cluster queue using the admission check fields.
   136  Just like ProvisioningRequest, MultiKueue will have its own configuration, 
   137  MultiKueueConfig with the following definition. To allow reusing the same clusters
   138  across many Kueues, additional object, MultiKueueWorkerCluster, is added.
   139  
   140  ```go
   141  type MultiKueueConfig struct {
   142      metav1.TypeMeta   `json:",inline"`
   143      metav1.ObjectMeta `json:"metadata,omitempty"`
   144      Spec MultiKueueConfigSpec `json:"spec,omitempty"`
   145  }
   146  
   147  type MultiKueueConfigSpec struct {
   148       // List of MultiKueueWorkerClusters names where the 
   149       // workloads from the ClusterQueue should be distributed.
   150       Clusters []string `json:"clusters,omitempty"`
   151  }
   152  
   153  type MultiKueueCluster struct {
   154      metav1.TypeMeta   `json:",inline"`
   155      metav1.ObjectMeta `json:"metadata,omitempty"`
   156      Spec MultiKueueClusterSpec `json:"spec,omitempty"`
   157      Status MultiKueueClusterStatus `json:"status,omitempty"`
   158  }
   159  
   160  type LocationType string
   161  
   162  const (
   163      // Location is the path on the disk of kueue-controller-manager.
   164      PathLocationType LocationType = "Path"
   165      
   166      // Location is the name of the secret inside the namespace in which the kueue controller
   167      // manager is running. The config should be stored in the "kubeconfig" key.
   168      SecretLocationType LocationType = "Secret"
   169  )
   170  
   171  type MultiKueueClusterSpec {
   172      // Information how to connect to the cluster.
   173      KubeConfig KubeConfig `json:"kubeConfig"`
   174  }
   175  
   176  type KubeConfig struct {
   177      // Location of the KubeConfig.
   178      Location string `json:"location"`
   179      
   180      // Type of the KubeConfig location.
   181      //
   182      // +kubebuilder:default=Secret
   183      // +kubebuilder:validation:Enum=Secret;Path
   184      LocationType LocationType `json:"locationType"`
   185  }
   186  
   187  type MultiKueueClusterStatus {
   188     Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
   189  }
   190  ```
   191  
   192  MultiKueue controller will monitor all cluster definitions and maintain 
   193  the Kube clients for all of them. Any connectivity problems will be reported both in
   194  MultiKueuCluster status as well as AdmissionCheckStatus and Events. MultiKueue controller 
   195  will make sure that whenever the kubeconfig is refreshed, the appropriate 
   196  clients will also be recreated.
   197  
   198  Creation of kubeconfig files is outside of the MultiKueue scope, and is cloud
   199  provider/environment dependant.
   200  
   201  MultiKueue controller, when pushing workloads to the worker clusters, will use the same 
   202  namespace and local queue names as were used in the management cluster. It is user's 
   203  responsibility to set up the appropriate namespaces and local queues.
   204  Worker ClusterQueue definitions may be different than in the management cluster. For example,
   205  quota settings may be specific to the given location. And/or cluster queue may have different
   206  admission checks, use ProvisioningRequest, etc.
   207  
   208  When distributing the workloads across clusters MultiKueue controller will first create
   209  the Kueue's internal Workload object. Only after the workload is admitted and other clusters
   210  are cleaned-up the real job will be created, to match the Workload. That gives the guarantee
   211  that the workload will not start in more than one cluster. The workload will
   212  get the annotation stating where it is actually running.
   213  
   214  When the job is running MultiKueue controller will copy its status from worker cluster
   215  to the management cluster, to keep the impression that the job is running in the management 
   216  cluster. This is needed to allow pipelines and workflow engines to execute against 
   217  the management cluster with MultiKueue without any extra changes. 
   218  
   219  If the connection between managment cluster and worker cluster is lost, the managment 
   220  cluster assumes the total loss of all running/admitted workloads and moves them back to 
   221  non-admitted/queued state. Once the cluster is reconnected, the workloads are reconciled.
   222  If there is enough of global quota, the unknown admitted workloads would be re-admitted in 
   223  the managment cluster. If not, some workloads will be preempted to meet the global quota.
   224  In case of duplicates, all but one of them will be removed.
   225  
   226  #### Follow ups ideas
   227  
   228  * Handle large number of clusters via selectors.
   229  * Provide plugin mechanism to controll how the workloads are distributed across worker clusers.
   230  
   231  ### Test Plan
   232  [x] I/we understand the owners of the involved components may require updates to
   233  existing tests to make this code solid enough prior to committing the changes necessary
   234  to implement this enhancement.
   235  
   236  #### Unit Tests
   237  The code will adhere to regular best practices for unit tests and coverage. 
   238  
   239  #### Integration tests
   240  Integration tests will be executed against a mocked clients for the worker clusters 
   241  that will provide predefined responses and allow to test various error scenarios, 
   242  including situations like:
   243  
   244  * Job is created across multiple clusters and admitted in one.
   245  * Job is admitted at the same time by two clusters.
   246  * Job is rejected by a cluster.
   247  * Worker cluster doesn't have the corresponding namespace.
   248  * Worker cluster doesn't have the corresponding local/cluster queue.
   249  * Worker cluster is unresponsive.
   250  * Worker cluster deletes the job.
   251  * Job is correctly finished.
   252  * Job finishes with an error.
   253  * Job status changes frequently.
   254  
   255  #### E2E tests
   256  Should be created and cover similar use cases as integration tests. For start
   257  it should focus on JobSet.
   258  
   259  ### Graduation Criteria
   260  The feature starts at the alpha level, with a feature gate.
   261  
   262  In Alpha version, in the 0.6 release, MultiKueue will support:
   263  
   264  * APIs as described above.
   265  * Basic workload distribution across clusters.
   266  * JobSet integragion, with full status relay.
   267  
   268  Other integrations may come in 0.6 (if lucky) or in following releases 
   269  of Kueue.
   270  
   271  Graduation to beta criteria:
   272  * Positive feedback from users.
   273  * Most of the integrations supported.
   274  * Major bugs and deficiencies are not found/fixed.
   275  * Roadmap for missing features is defined.
   276  
   277  ## Implementation History
   278  * 2023-11-28 Initial KEP.
   279  
   280  ## Drawbacks
   281  MultiKueue has some drawbacks.
   282  * Doesn't solve storage problems.
   283  * Requires some manual works to sync configuration and authentication between clusters.
   284  * Requires management cluster.
   285  * Requires some external work to disable job controller(s) in management clusters.
   286  * Scalability and throughput depends on etcd.
   287  
   288  ## Alternatives
   289  * Use Armada or Multi Cluster App Dispatcher.
   290  * Use multicluster-specific Job APIs.
   291