sigs.k8s.io/kueue@v0.6.2/keps/168-pending-workloads-visibility/README.md

sigs.k8s.io/kueue@v0.6.2/keps/168-pending-workloads-visibility/README.md (about)

1 # KEP-168: Pending workloads visibility
2
3 
8
9 
18
19 
20 - [Summary](#summary)
21 - [Motivation](#motivation)
22 - [Goals](#goals)
23 - [Non-Goals](#non-goals)
24 - [Proposal](#proposal)
25 - [User Stories (Optional)](#user-stories-optional)
26 - [Story 1](#story-1)
27 - [Story 2](#story-2)
28 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
29 - [Risks and Mitigations](#risks-and-mitigations)
30 - [Too large objects](#too-large-objects)
31 - [Status updates for pending workloads slowing down other operations](#status-updates-for-pending-workloads-slowing-down-other-operations)
32 - [Large number of API requests triggered after workload admissions](#large-number-of-api-requests-triggered-after-workload-admissions)
33 - [Design Details](#design-details)
34 - [Local Queue API](#local-queue-api)
35 - [Cluster Queue API](#cluster-queue-api)
36 - [Configuration API](#configuration-api)
37 - [In-memory snapshot of the ClusterQueue](#in-memory-snapshot-of-the-clusterqueue)
38 - [Throttling of status updates](#throttling-of-status-updates)
39 - [Choosing the limits and defaults for MaxCount](#choosing-the-limits-and-defaults-for-maxcount)
40 - [Limitation of the approach](#limitation-of-the-approach)
41 - [Test Plan](#test-plan)
42 - [Prerequisite testing updates](#prerequisite-testing-updates)
43 - [Unit Tests](#unit-tests)
44 - [Integration tests](#integration-tests)
45 - [Graduation Criteria](#graduation-criteria)
46 - [Beta](#beta)
47 - [Stable](#stable)
48 - [Implementation History](#implementation-history)
49 - [Drawbacks](#drawbacks)
50 - [Alternatives](#alternatives)
51 - [Alternative approaches](#alternative-approaches)
52 - [Coarse-grained ordering information per workload in workload status](#coarse-grained-ordering-information-per-workload-in-workload-status)
53 - [Ordering information per workload in events or metrics](#ordering-information-per-workload-in-events-or-metrics)
54 - [On-demand http endpoint](#on-demand-http-endpoint)
55 - [Alternatives within the proposal](#alternatives-within-the-proposal)
56 - [Unlimited MaxCount parameter](#unlimited-maxcount-parameter)
57 - [Expose the pending workloads only for LocalQueues](#expose-the-pending-workloads-only-for-localqueues)
58 - [Do not expose ClusterQueue positions in LocalQueues](#do-not-expose-clusterqueue-positions-in-localqueues)
59 - [Use self-balancing search trees for ClusterQueue representation](#use-self-balancing-search-trees-for-clusterqueue-representation)
60 
61
62 ## Summary
63
64 The enhancement extends the API of LocalQueue and ClusterQueue to expose the
65 information about the order of their pending workloads.
66
67 ## Motivation
68
69 Currently, there is no visibility of the contents of the queues. This is
70 problematic for Kueue users, who have no means to estimate when their jobs will
71 start. Also, it is problematic for administrators, who would like to monitor
72 the pipeline of pending jobs, and help users to debug issues.
73
74 
82
83 ### Goals
84
85 - expose the order of workloads in the LocalQueue and ClusterQueue
86
87 
91
92 ### Non-Goals
93
94 - expose the information about workload position for each pending workload in
95 in case of very long queues
96
97 
101
102 ## Proposal
103
104 The proposal is to extend the APIs for the status of LocalQueue and ClusterQueue
105 to expose the order of pending workloads. The order will be only exposed up to
106 some configurable depth, in order to keep the size of the information
107 constrained.
108
109 
117
118 ### User Stories (Optional)
119
120 
126
127 #### Story 1
128
129 As a user of Kueue with LocalQueue visibility only, I would like to know the
130 position of my workload in the ClusterQueue, I have no direct visibility into.
131 Knowing the position, and assuming stable velocity in the ClusterQueue, would
132 allow me to estimate the arrival time of my workload.
133
134 #### Story 2
135
136 As an administrator of Kueue with ClusterQueue visibility I would like to be
137 able to check directly and compare positions of pending workloads in the queue.
138 This will help me to answer users' questions about their workloads.
139
140 Note that, merging the information exposed by individual local queues is not
141 enough, because they may be showing inconsistent data due to delays in updates.
142 For example, two workloads in different local queues may return the same
143 position in ClusterQueue.
144
145 ### Notes/Constraints/Caveats (Optional)
146
147 
153
154 ### Risks and Mitigations
155
156 #### Too large objects
157
158 As the number of pending workloads is arbitrarily large there is a risk that the
159 status information about the workloads may exceed the etcd limit of 1.5Mi on
160 object size.
161
162 Exceeding the etcd limit has a risk that the LocalQueue controller updates can
163 fail.
164
165 In order to mitigate this risk we introduce the `MaxCount` configuration
166 parameter to limit the maximal number of pending workloads in the status.
167 Additionally, limit the maximal value of the parameter to 4000, see
168 also [Choosing the limits and defaults for MaxCount](#choosing-the-limits-and-defaults-for-maxcount).
169
170 We should also note that large queue objects might be problematic for the
171 kubernetes API server, even if the etcd limit is not exceeded. For example,
172 when there are many LocalQueue instances with watches, because in that case
173 the entire LocalQueue objects need to be sent though the watch channels.
174
175 To mitigate this risk we also extend the Kueue's user-facing documentation to
176 warn about setting this number high on clusters with many LocalQueue instances,
177 especially, when watches on the objects are used.
178
179 #### Status updates for pending workloads slowing down other operations
180
181 The operation of computing and updating the list of top pending workloads can
182 have a degrading impact on the overall performance of other Kueue operations.
183
184 This risk exists because the operation requires iteration over the contents of
185 the cluster queue, which requires a read lock on the queue. Also, positional
186 changes to the list of pending workloads may require more frequent updates if
187 attempt to keep the information up-to-date.
188
189 In order to mitigate the risk we maintain the statuses on best-effort basis,
190 and issue at most one update request in a configured interval,
191 see [throttling of status updates](#throttling-of-status-updates).
192
193 Additionally, we take periodically an in-memory snapshot of the ClusterQueue to
194 allow generation of the status with `MaxCount` elements for LocalQueues and
195 ClusterQueues without taking the read lock for a prolonged time:
196 [In-memory snapshot of the ClusterQueue](#In-memory snapshot of the ClusterQueue).
197
198 #### Large number of API requests triggered after workload admissions
199
200 In a scenario when we have multiple LocalQueues pointing to the same
201 ClusterQueue a workload that is admitted in one LocalQueue shifts positions of
202 pending workloads in other LocalQueues. In the worst case scenario updating the
203 LocalQueue statuses with new positions requires as many API requests as the
204 number of LocalQueues. In particular, sending over 100 requests after workload
205 admission would degrade Kueue performance.
206
207 First, we propose to batch the LocalQueue updates by time intervals. This helps
208 to avoid sending API requests per LocalQueue if the positions are shifted
209 multiple times in a short period of time.
210
211 Second, we introduce the `MaxPosition` parameter configuration parameter. With
212 this parameter, the number of LocalQueues requiring an update can be controlled,
213 because only LocalQueues with workloads at the top positions require an update.
214
215 Finally, setting the `MaxCount` parameter for LocalQueues to 0 allows to stop
216 visibility updates to LocalQueues.
217
218 
229
230 ## Design Details
231
232 The APIs of the status for LocalQueue and ClusterQueue are extended by
233 structures which contain the list of pending workloads. In case of LocalQueue,
234 also the workload position in the ClusterQueue is exposed.
235
236 Updates to the structures are throttled, allowing for at most one update within
237 a configured interval. Additionally, we take periodically an in-memory snapshot
238 of the ClusterQueue.
239
240 ### Local Queue API
241
242 ```golang
243 // LocalQueuePendingWorkload contains the information identifying a pending
244 // workload in the local queue.
245 type LocalQueuePendingWorkload struct {
246 // Name indicates the name of the pending workload.
247 Name string
248
249 // Position indicates the position of the workload in the cluster queue.
250 Position *int32
251 }
252
253 type LocalQueuePendingWorkloadsStatus struct {
254 // Head contains the list of top pending workloads.
255 // +listType=map
256 // +listMapKey=name
257 // +optional
258 Head []LocalQueuePendingWorkload
259
260 // LastChangeTime indicates the time of the last change of the structure.
261 LastChangeTime metav1.Time
262 }
263
264 // LocalQueueStatus defines the observed state of LocalQueue
265 type LocalQueueStatus struct {
266 ...
267 // PendingWorkloadsStatus contains the information exposed about the current
268 // status of pending workloads in the local queue.
269 // +optional
270 PendingWorkloadsStatus *LocalQueuePendingWorkloadsStatus
271 ...
272 }
273 ```
274
275 ### Cluster Queue API
276
277 ```golang
278 // ClusterQueuePendingWorkload contains the information identifying a pending workload
279 // in the cluster queue.
280 type ClusterQueuePendingWorkload struct {
281 // Name indicates the name of the pending workload.
282 Name string
283
284 // Namespace indicates the name of the pending workload.
285 Namespace string
286 }
287
288 type ClusterQueuePendingWorkloadsStatus struct {
289 // Head contains the list of top pending workloads.
290 // +listType=map
291 // +listMapKey=name
292 // +listMapKey=namespace
293 // +optional
294 Head []ClusterQueuePendingWorkload
295
296 // LastChangeTime indicates the time of the last change of the structure.
297 LastChangeTime metav1.Time
298 }
299
300 // ClusterQueueStatus defines the observed state of ClusterQueueStatus
301 type ClusterQueueStatus struct {
302 ...
303 // PendingWorkloadsStatus contains the information exposed about the current
304 // status of the pending workloads in the cluster queue.
305 // +optional
306 PendingWorkloadsStatus *ClusterQueuePendingWorkloadsStatus
307 ...
308 }
309 ```
310
311 ### Configuration API
312
313 ```golang
314 // Configuration is the Schema for the kueueconfigurations API
315 type Configuration struct {
316 ...
317 // QueueVisibility is configuration to expose the information about the top
318 // pending workloads.
319 QueueVisibility *QueueVisibility
320 }
321
322 type QueueVisibility struct {
323 // LocalQueues is configuration to expose the information
324 // about the top pending workloads in the local queue.
325 LocalQueues *LocalQueueVisibility
326
327 // ClusterQueues is configuration to expose the information
328 // about the top pending workloads in the cluster queue.
329 ClusterQueues *ClusterQueueVisibility
330
331 // UpdateInterval specifies the time interval for updates to the structure
332 // of the top pending workloads in the queues.
333 // Defaults to 5s.
334 UpdateInterval time.Duration
335 }
336
337 type LocalQueueVisibility struct {
338 // MaxCount indicates the maximal number of pending workloads exposed in the
339 // local queue status. When the value is set to 0, then LocalQueue visibility
340 // updates are disabled.
341 // The maximal value is 4000.
342 // Defaults to 10.
343 MaxCount int32
344
345 // MaxPosition indicates the maximal position of the workload in the cluster
346 // queue returned in the head.
347 MaxPosition *int32
348 }
349
350 type ClusterQueueVisibility struct {
351 // MaxCount indicates the maximal number of pending workloads exposed in the
352 // cluster queue status. When the value is set to 0, then LocalQueue
353 // visibility updates are disabled.
354 // The maximal value is 4000.
355 // Defaults to 10.
356 MaxCount int32
357 }
358 ```
359
360 ### In-memory snapshot of the ClusterQueue
361
362 In order to be able to quickly compute the top pending workloads per LocalQueue,
363 without a need for a prolonged read lock on the ClusterQueue, we create
364 periodically in-memory snapshot of the ClusterQueue, organized as a map
365 from the LocalQueue to the list of workloads belonging to the ClusterQueue,
366 along with their positions. Then, the LocalQueue and ClusterQueue controllers
367 do lookup into the cached structure.
368
369 The snapshots are taken periodically, per ClusterQueue, by multiple workers
370 processing a queue of snapshot-taking tasks. The tasks are re-enqueued to the
371 queue with `QueueVisibility.UpdateInterval` delay just after taking the previous
372 snapshot for as long as a given ClusterQueue exists.
373
374 The model of using snapshot workers allows to control the number of snapshot
375 updates after Kueue startup, and thus cascading ClusterQueues updates. The
376 number of workers is 5.
377
378 Note that taking the snapshot requires taking the ClusterQueue read lock
379 only for the duration of copying the underlying heap data
380
381 When `MaxCount` for both LocalQueues and ClusterQueues is 0, then the feature
382 is disabled, and the snapshot is not computed.
383
384 ### Throttling of status updates
385
386 The updates to the structure of top pending workloads for LocalQueue (or
387 ClusterQueue) are managed by the LocalQueue controller (or ClusterQueue controller)
388 and are part of regular status updates of the queue.
389
390 The updates to the structure of the pending workloads are generated based on the
391 periodically taken snapshot.
392
393 In particular, when LocalQueue reconciles, and the `LastChangeTime` indicates
394 that `QueueVisibility.UpdateInterval` elapsed, then we generate the new structure
395 based on the snapshot. If there is a change to the structure, then `LastChangeTime`
396 is bumped, and the request is sent. If there is no change to the structure,
397 then the controller enqueues another reconciliation when the snapshot will be
398 regenerated.
399
400 ### Choosing the limits and defaults for MaxCount
401
402 One constraining factor for the default for `MaxCount` is the maximal object
403 size for etcd, see [Too large objects](#too-large-objects).
404
405 A similar consideration was done for the [Backoff Limit Per Index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs#the-job-object-too-big)
406 feature where we set the parameter limits to constrain the size of the object in
407 the worst case scenario around 500Ki. Such approach allows to stay relatively
408 far from the 1.5Mi limit, and allow future extensions of the structures.
409
410 Following this approach in case of Kueue we are limiting the `MaxCount`
411 parameter to `4000` for ClusterQueues and LocalQueues. This translates to
412 around `4000*63*2=0.48Mi` for ClusterQueues, and `4000*(63+4)=0.26Mi` for
413 LocalQueues.
414
415 The defaults are tuned for lower-scale usage in order to minimize the risk of
416 issues on upgrading Kueue, as the feature is going to be enabled by default.
417 For comparison, the Backoff Limit Per Index, the feature is opted-in per Job, so
418 the consequences of issues are smaller that when the feature is enabled for
419 all workloads.
420
421 Similarly, we default the `MaxPosition` configuration parameter for LocalQueues
422 to `10`. This parameter allows to control the number of LocalQueues which
423 are updated after a workload admission (see also:
424 [Large number of API requests triggered after workload admissions](#large-number-of-api-requests-triggered-after-workload-admissions)).
425
426 Enabling the feature by default will allow more users to discover the feature.
427 Then, based on their needs and setup they can increase the `MaxCount` and
428 `MaxPosition` parameters.
429
430 ### Limitation of the approach
431
432 We acknowledge the limitation of the proposed approach that only to N workloads
433 are exposed. This might be problematic for some large-scale setups.
434
435 This means that the feature may be superseded by one of the
436 [Alternative approaches](#alternative-approaches) in the future, and potentially
437 be deprecated.
438
439 Still, we believe it makes sense to proceed with the proposed approach as it is
440 relatively simple to implement, and will already start providing value to
441 the Kueue users with relatively small setups.
442
443 Finally, the proposed solution is likely to co-exist with another alternative,
444 because it would be advantageous in a smaller scale. Finally, the internal code
445 extensions, such as the in-memory snapshot for the ClusterQueue, are likely to
446 be reused as a building block for other approaches.
447
448 
454
455 ### Test Plan
456
457 
467
468 [x] I/we understand the owners of the involved components may require updates to
469 existing tests to make this code solid enough prior to committing the changes necessary
470 to implement this enhancement.
471
472 ##### Prerequisite testing updates
473
474 
478
479 #### Unit Tests
480
481 
487
488 
497
498 - `<package>`: `<date>` - `<test coverage>`
499
500 #### Integration tests
501
502 The integration tests will cover scenarios:
503 - the local queue status is updated when a workload in this local queue is added,
504 preempted or admitted,
505 - the addition of a workload to one local queue triggers and update of the
506 structure in another local queue connected with the same cluster queue,
507 - changes of the workload positions beyond the configured threshold for top
508 pending workloads don't trigger an update of the pending workloads status.
509
510 
515
516 ### Graduation Criteria
517
518 #### Beta
519
520 First iteration (0.5):
521
522 - support visibility for ClusterQueues
523
524 Second iteration (0.6):
525
526 - support visibility for LocalQueues, but without positions,
527 to avoid the complication of avoiding the risk [Large number of API requests triggered after workload admissions](#large-number-of-api-requests-triggered-after-workload-admissions)
528
529 Third iteration (0.7):
530
531 - reevaluate the need for exposing positions and support if needed
532
533 #### Stable
534
535 - drop the feature gate
536
537 
552
553 ## Implementation History
554
555 
565
566 ## Drawbacks
567
568 
571
572 ## Alternatives
573
574 ### Alternative approaches
575
576 The alternatives are designed to solve the limitation for the maximal number of
577 pending workloads which is returned in the status.
578
579 #### Coarse-grained ordering information per workload in workload status
580
581 The idea is to distribute the ordering information among workloads to avoid
582 keeping the ordering information centralized, thus avoiding creating objects
583 constrained by the etcd limit.
584
585 The main complication with distributing the ordering information is that a
586 workload admission, or a new workload with a high priority can move the entire
587 ordering, warranting update requests to all workloads in the queue. This could
588 mean cascades of thousands of requests after such event.
589
590 The proposal to control the number of update requests to workloads when a
591 workload is admitted or added, is to bucket workload positions. The bucket
592 intervals could grow exponentially, allowing for logarithmic number of requests
593 needed. With this approach, the number of requests to update workloads is limited
594 by the number of buckets, as only the workloads on bucket boundary are updated.
595
596 The update requests could be sent by a periodic routine which iterates over the
597 cluster queue and triggers workload reconciliation for workloads for which the
598 ordering is changed.
599
600 Pros:
601 - allows to expose the ordering information for all workloads, guaranteeing the
602 user to know its workload position even if it is beyond the top N threshold
603 in the proposed approach.
604
605 Cons:
606 - it requires a substantial number of requests when a workload is admitted, or
607 a high priority workload is inserted. For example, assuming 1000 workloads,
608 and expotential bucketing with base 2, this is 10 requests.
609 - it is not clear if the coarse-grained information would satisfy user
610 expectations. For example, a user may need to wait long to observe reduction
611 of a bucket.
612 - an external system which wants to display a pipeline of workloads needs to
613 fetch all workloads. Similarly, as system which wants to list top 10 workloads
614 may need to query all workloads.
615 - a natural extension of the mechanism to return ETA in the workload status
616 may also increase the number of requests in a less controlled way.
617
618 #### Ordering information per workload in events or metrics
619
620 The motivation for this approach is similar as for distributing the information
621 in workload statuses. However, it builds on the assumption that update requests
622 are more costly than events or metric updates. For example, sending events or
623 updating metrics does not trigger a workload reconciliation.
624
625 Pros:
626 - more lightweight than updating workload status,
627
628 Cons:
629 - the API based on events or metrics would be less convenient to end users than
630 object-based.
631 - probably still requires bucketing, thus inheriting the usability cons related
632 to bucking from the workload status approach.
633
634 #### On-demand http endpoint
635
636 The idea is that Kueue exposes an endpoint which allows to fetch the ordering
637 information for all pending workloads, or for a selected workloads.
638
639 Pros:
640 - eliminates wasting QPS for updating kubernetes objects
641
642 Cons:
643 - the API will lack of the API server features, such as watches or P&F throttling,
644 load-balancing. Also, the ensuring security of the new workload might be
645 more involving, making it technically challenging.
646
647 One possible way of to deal with the security concern of
648 [On-demand http endpoint](#on-demand-http-endpoint) is to use
649 [Extension API Server](https://kubernetes.io/docs/tasks/extend-kubernetes/setup-extension-api-server/),
650 exposed via
651 [API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/).
652 Then, the aggregation layer could take the responsibility of authenticating and
653 authorizing the requests.
654
655 ### Alternatives within the proposal
656
657 Here are some alternatives to solve smaller problems within the realm of the
658 proposal.
659
660 #### Unlimited MaxCount parameter
661
662 The `MaxCount` parameter constrains the maximal size of the ClusterQueue and
663 LocalQueue statuses to ensure that the object size limit of etcd is not exceeded,
664 see [Too large objects](#too-large-objects).
665
666 The actual maximal number might depends of the lengths of the names of namespaces
667 and names. Such names typically will be far from the maximum. In particular,
668 the namespaces might be created based on team names, which may have an internal
669 policy of not exceeding, say 100, characters. In that case, the estimation
670 would be too constraining. We propose to add a soft warning when 2000 is
671 exceeded, and warn in documentation.
672
673 **Reasons for discarding/deferring**
674
675 Setting hard limits for the parameters allows to avoid users to crash their
676 systems. We will re-evaluate the decision based on users feedback. One alternative
677 is to make the limit soft, rather than hard. Another is to implement and support
678 another alternative solution for large-scale usage.
679
680 #### Expose the pending workloads only for LocalQueues
681
682 It was proposed, that for administrators, with full access to the cluster we
683 could have an alternative approaches, which don't involve the status of the
684 ClusterQueue.
685
686 **Reasons for discarding/deferring**
687
688 The solution proposed for LocalQueues is easy to transfer for ClusterQueues.
689 Developing another approach just focused on admins might be problematic.
690
691 #### Do not expose ClusterQueue positions in LocalQueues
692
693 It was proposed, that without exposing the positions in the cluster queues we
694 don't need to update LocalQueues when workloads from another LocalQueue are
695 admitted, or added to. Additionally, the positional information does not reveal
696 much about the actual time to admit the workloads, the other workloads might
697 be small or big.
698
699 **Reasons for discarding/deferring**
700
701 First, getting to know the positional information gives some hits about the
702 expected arrival time. Especially as users of the systems gain some experience
703 about the velocity of the ClusterQueue. In particular, it could be estimated,
704 based on historical data, data that 10 workloads are admitted every 1h. This
705 makes already a difference if a user knows that its workload is positioned
706 1 or 100.
707
708 With the throttling for updating the list of pending workloads the
709 change in positional information will not trigger too many status updates.
710
711 Also, even without positional information it is possible that an update is
712 needed because while one workload is admitted another one is added. Such
713 situations would require additional updates, so we should introduce some
714 throttling mechanism for updates.
715
716 #### Use self-balancing search trees for ClusterQueue representation
717
718 Using self-balancing search trees for ClusterQueue could be used to quickly
719 provide the list of top workloads in ClusterQueue.
720
721 **Reasons for discarding/deferring**
722
723 It does not solve the issue of exposing the information for LocalQueues. If
724 we have many (or just multiple) LocalQueues pointing to the same ClusterQueue,
725 each of them would need to take a read lock for the iteration, and potentially
726 iterate over the entire ClusterQueue.
727
728