sigs.k8s.io/kueue@v0.6.2/keps/693-multikueue/README.md

sigs.k8s.io/kueue@v0.6.2/keps/693-multikueue/README.md (about)

1 # KEP-693: MultiKueue
2
3 
4 - [Summary](#summary)
5 - [Motivation](#motivation)
6 - [Goals](#goals)
7 - [Non-Goals](#non-goals)
8 - [Proposal](#proposal)
9 - [User Stories (Optional)](#user-stories-optional)
10 - [Story 1](#story-1)
11 - [Story 2](#story-2)
12 - [Risks and Mitigations](#risks-and-mitigations)
13 - [Design Details](#design-details)
14 - [Follow ups ideas](#follow-ups-ideas)
15 - [Test Plan](#test-plan)
16 - [Unit Tests](#unit-tests)
17 - [Integration tests](#integration-tests)
18 - [E2E tests](#e2e-tests)
19 - [Graduation Criteria](#graduation-criteria)
20 - [Implementation History](#implementation-history)
21 - [Drawbacks](#drawbacks)
22 - [Alternatives](#alternatives)
23 
24
25 ## Summary
26 Introduce an new AdmissionCheck (called MultiKueue) with dedicated API
27 and controller that will provide multi-cluster capabilities to Kueue.
28
29 ## Motivation
30 Many of Kueue's users are running multiple clusters and would like to
31 have a way to easily distribute batch jobs across them to keep all of
32 them utilized. Without a global distribution point, some clusters may
33 get less jobs they are able to process, while the others get more, leading
34 to underutilization and higher costs.
35
36 ### Goals
37 * Allow Kueue to distribute batch jobs across multiple clusters,
38 while maintaining the specified quota limits.
39 * Provide users with a single entry point through which the jobs
40 can be submitted and monitored, just like they were running in
41 a single cluster.
42 * Be compatible with all Kueue's features (priorities, borrowing, preemptions, etc)
43 and most of integrations.
44 * Allow to upgrade single cluster Kueue deployments to multicluster without
45 much hassle.
46
47 ### Non-Goals
48 * Solve storage problem. It is assumed that the distributed jobs are
49 either location-flexible (for a subset of clusters) or are copying the
50 data as a part of the startup process.
51 * Automatically detect and configure new clusters.
52 * Synchronize configuration across the clusters. It is expected that the
53 user will create the appropriate objects, roles and permissions
54 in the clusters (manually, using gitops or some 3rd-party tooling).
55 * Set up authentication between clusters.
56 * Support very high job troughput (>1M jobs/day).
57 * Support K8S Jobs on management clusters that don't have either
58 kubernetes/enhancements#4370 implemented or Job controller disabled.
59 * Support for cluster role sharing (worker & manager insinde one cluster)
60 is out of scope for this KEP. We will get back to the topic once
61 kubernetes/enhancements#4370 is merged and becomes a wider standard.
62 * distribute running Jobs across multiple clusters, and reconcile partial
63 results in the Job objects on the management cluster (each Job will run on
64 a single worker cluster).
65
66 ## Proposal
67
68 Introduce MultiKueue AdmissionCheck, controller and configuration API.
69
70 Establish the need for a designated management cluster.
71
72 ![Architecture](arch.png "Architecture")
73
74 For each workload coming to a ClusterQueue (with the MultiKueue AdmissonCheck enabled)
75 in the management cluster, and getting past the preadmission phase in the
76 two-phase admission process (meaning that the global quota - total amount resources
77 that can be consumed across all clusters - is ok),
78 MultiKueue controller will clone it in the defined worker clusters and wait
79 until some Kueue running there admits the workload.
80 If a remote workload is admitted first, the job will be created
81 in the remote cluster with a `kueue.x-k8s.io/prebuilt-workload-name` label pointing to that clone.
82 Then it will remove the workloads from the remaining worker clusters and allow the
83 single instance of the job to proceed. The workload will be also admitted in
84 the management cluster.
85
86 There will be no job controllers running in the management clusters or they will be
87 disabled for the workloads coming to MultiKueue-enabled cluster queues via annotation
88 or some other, yet to be decided, mechanism. By disabling we mean that the controller
89 will do no action on the selected objects, no pods (or other objects created and
90 allowing other controllets to update their status as they see fit.
91
92 There will be just CRD/job definitions deployed. MultiKueue controller will copy the status
93 of the job from the worker clusters, so that it will appear that the job
94 is running inside the management clusters. However, as there is no job controller,
95 no pods will be created in the management cluster. No controller will also overwrite
96 the status that will be copied by the MultiKueue controller.
97
98 If the job, for whatever reason, is suspended or deleted in the management cluster,
99 it will be deleted from the worker cluster. Deletion/suspension of the job
100 only in worker cluster will trigger the global job requeuing.
101 Once the job finishes in the worker cluster, the job will also
102 finish in the management cluster.
103
104 ### User Stories (Optional)
105
106 #### Story 1
107 As a Kueue user I have clusters on different cloud providers and on-prem.
108 I would like to run computation-heavy jobs across all of them, wherever
109 I have free resources.
110
111 #### Story 2
112 As a Kueue user I have clusters in multiple regions of the same cloud
113 provider. I would like to run workloads that require the newest GPUs,
114 whose on-demand availability is very volatile. The GPUs are available
115 at random times at random regions. I want to use ProvisioningRequest
116 to try to catch them.
117
118 ### Risks and Mitigations
119 * Disabling the Job controller for all (or selected objects) may be problematic
120 on environments where access to the master configuration is limited (like GKE).
121 We are working on kubernetes/enhancements#4370
122 to establish an acceptable way of using a non default controller (or none at all).
123
124 * etcd may not provide enough performance (writes/s) to handle very large
125 deployments with very high job troughput (above 1M jobs per day).
126
127 * Management cluster could be a single point of failure. The mitigations include:
128 * Running multiple managmenet clusters with infinite global quotas and
129 correct, limiting worker cluster local quotas.
130 * Running multiple managment clusters, with one leader and back-up clusters
131 learning the state of the world from the worker clusters (not covered
132 by this KEP).
133
134 ## Design Details
135 MultiKueue will be enabled on a cluster queue using the admission check fields.
136 Just like ProvisioningRequest, MultiKueue will have its own configuration,
137 MultiKueueConfig with the following definition. To allow reusing the same clusters
138 across many Kueues, additional object, MultiKueueWorkerCluster, is added.
139
140 ```go
141 type MultiKueueConfig struct {
142 metav1.TypeMeta `json:",inline"`
143 metav1.ObjectMeta `json:"metadata,omitempty"`
144 Spec MultiKueueConfigSpec `json:"spec,omitempty"`
145 }
146
147 type MultiKueueConfigSpec struct {
148 // List of MultiKueueWorkerClusters names where the
149 // workloads from the ClusterQueue should be distributed.
150 Clusters []string `json:"clusters,omitempty"`
151 }
152
153 type MultiKueueCluster struct {
154 metav1.TypeMeta `json:",inline"`
155 metav1.ObjectMeta `json:"metadata,omitempty"`
156 Spec MultiKueueClusterSpec `json:"spec,omitempty"`
157 Status MultiKueueClusterStatus `json:"status,omitempty"`
158 }
159
160 type LocationType string
161
162 const (
163 // Location is the path on the disk of kueue-controller-manager.
164 PathLocationType LocationType = "Path"
165
166 // Location is the name of the secret inside the namespace in which the kueue controller
167 // manager is running. The config should be stored in the "kubeconfig" key.
168 SecretLocationType LocationType = "Secret"
169 )
170
171 type MultiKueueClusterSpec {
172 // Information how to connect to the cluster.
173 KubeConfig KubeConfig `json:"kubeConfig"`
174 }
175
176 type KubeConfig struct {
177 // Location of the KubeConfig.
178 Location string `json:"location"`
179
180 // Type of the KubeConfig location.
181 //
182 // +kubebuilder:default=Secret
183 // +kubebuilder:validation:Enum=Secret;Path
184 LocationType LocationType `json:"locationType"`
185 }
186
187 type MultiKueueClusterStatus {
188 Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
189 }
190 ```
191
192 MultiKueue controller will monitor all cluster definitions and maintain
193 the Kube clients for all of them. Any connectivity problems will be reported both in
194 MultiKueuCluster status as well as AdmissionCheckStatus and Events. MultiKueue controller
195 will make sure that whenever the kubeconfig is refreshed, the appropriate
196 clients will also be recreated.
197
198 Creation of kubeconfig files is outside of the MultiKueue scope, and is cloud
199 provider/environment dependant.
200
201 MultiKueue controller, when pushing workloads to the worker clusters, will use the same
202 namespace and local queue names as were used in the management cluster. It is user's
203 responsibility to set up the appropriate namespaces and local queues.
204 Worker ClusterQueue definitions may be different than in the management cluster. For example,
205 quota settings may be specific to the given location. And/or cluster queue may have different
206 admission checks, use ProvisioningRequest, etc.
207
208 When distributing the workloads across clusters MultiKueue controller will first create
209 the Kueue's internal Workload object. Only after the workload is admitted and other clusters
210 are cleaned-up the real job will be created, to match the Workload. That gives the guarantee
211 that the workload will not start in more than one cluster. The workload will
212 get the annotation stating where it is actually running.
213
214 When the job is running MultiKueue controller will copy its status from worker cluster
215 to the management cluster, to keep the impression that the job is running in the management
216 cluster. This is needed to allow pipelines and workflow engines to execute against
217 the management cluster with MultiKueue without any extra changes.
218
219 If the connection between managment cluster and worker cluster is lost, the managment
220 cluster assumes the total loss of all running/admitted workloads and moves them back to
221 non-admitted/queued state. Once the cluster is reconnected, the workloads are reconciled.
222 If there is enough of global quota, the unknown admitted workloads would be re-admitted in
223 the managment cluster. If not, some workloads will be preempted to meet the global quota.
224 In case of duplicates, all but one of them will be removed.
225
226 #### Follow ups ideas
227
228 * Handle large number of clusters via selectors.
229 * Provide plugin mechanism to controll how the workloads are distributed across worker clusers.
230
231 ### Test Plan
232 [x] I/we understand the owners of the involved components may require updates to
233 existing tests to make this code solid enough prior to committing the changes necessary
234 to implement this enhancement.
235
236 #### Unit Tests
237 The code will adhere to regular best practices for unit tests and coverage.
238
239 #### Integration tests
240 Integration tests will be executed against a mocked clients for the worker clusters
241 that will provide predefined responses and allow to test various error scenarios,
242 including situations like:
243
244 * Job is created across multiple clusters and admitted in one.
245 * Job is admitted at the same time by two clusters.
246 * Job is rejected by a cluster.
247 * Worker cluster doesn't have the corresponding namespace.
248 * Worker cluster doesn't have the corresponding local/cluster queue.
249 * Worker cluster is unresponsive.
250 * Worker cluster deletes the job.
251 * Job is correctly finished.
252 * Job finishes with an error.
253 * Job status changes frequently.
254
255 #### E2E tests
256 Should be created and cover similar use cases as integration tests. For start
257 it should focus on JobSet.
258
259 ### Graduation Criteria
260 The feature starts at the alpha level, with a feature gate.
261
262 In Alpha version, in the 0.6 release, MultiKueue will support:
263
264 * APIs as described above.
265 * Basic workload distribution across clusters.
266 * JobSet integragion, with full status relay.
267
268 Other integrations may come in 0.6 (if lucky) or in following releases
269 of Kueue.
270
271 Graduation to beta criteria:
272 * Positive feedback from users.
273 * Most of the integrations supported.
274 * Major bugs and deficiencies are not found/fixed.
275 * Roadmap for missing features is defined.
276
277 ## Implementation History
278 * 2023-11-28 Initial KEP.
279
280 ## Drawbacks
281 MultiKueue has some drawbacks.
282 * Doesn't solve storage problems.
283 * Requires some manual works to sync configuration and authentication between clusters.
284 * Requires management cluster.
285 * Requires some external work to disable job controller(s) in management clusters.
286 * Scalability and throughput depends on etcd.
287
288 ## Alternatives
289 * Use Armada or Multi Cluster App Dispatcher.
290 * Use multicluster-specific Job APIs.
291