sigs.k8s.io/kueue@v0.6.2/keps/168-pending-workloads-visibility/README.md (about)

     1  # KEP-168: Pending workloads visibility
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Goals](#goals)
    23    - [Non-Goals](#non-goals)
    24  - [Proposal](#proposal)
    25    - [User Stories (Optional)](#user-stories-optional)
    26      - [Story 1](#story-1)
    27      - [Story 2](#story-2)
    28    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    29    - [Risks and Mitigations](#risks-and-mitigations)
    30      - [Too large objects](#too-large-objects)
    31      - [Status updates for pending workloads slowing down other operations](#status-updates-for-pending-workloads-slowing-down-other-operations)
    32      - [Large number of API requests triggered after workload admissions](#large-number-of-api-requests-triggered-after-workload-admissions)
    33  - [Design Details](#design-details)
    34    - [Local Queue API](#local-queue-api)
    35    - [Cluster Queue API](#cluster-queue-api)
    36    - [Configuration API](#configuration-api)
    37    - [In-memory snapshot of the ClusterQueue](#in-memory-snapshot-of-the-clusterqueue)
    38    - [Throttling of status updates](#throttling-of-status-updates)
    39    - [Choosing the limits and defaults for MaxCount](#choosing-the-limits-and-defaults-for-maxcount)
    40    - [Limitation of the approach](#limitation-of-the-approach)
    41    - [Test Plan](#test-plan)
    42        - [Prerequisite testing updates](#prerequisite-testing-updates)
    43      - [Unit Tests](#unit-tests)
    44      - [Integration tests](#integration-tests)
    45    - [Graduation Criteria](#graduation-criteria)
    46      - [Beta](#beta)
    47      - [Stable](#stable)
    48  - [Implementation History](#implementation-history)
    49  - [Drawbacks](#drawbacks)
    50  - [Alternatives](#alternatives)
    51    - [Alternative approaches](#alternative-approaches)
    52      - [Coarse-grained ordering information per workload in workload status](#coarse-grained-ordering-information-per-workload-in-workload-status)
    53      - [Ordering information per workload in events or metrics](#ordering-information-per-workload-in-events-or-metrics)
    54      - [On-demand http endpoint](#on-demand-http-endpoint)
    55    - [Alternatives within the proposal](#alternatives-within-the-proposal)
    56      - [Unlimited MaxCount parameter](#unlimited-maxcount-parameter)
    57      - [Expose the pending workloads only for LocalQueues](#expose-the-pending-workloads-only-for-localqueues)
    58      - [Do not expose ClusterQueue positions in LocalQueues](#do-not-expose-clusterqueue-positions-in-localqueues)
    59      - [Use self-balancing search trees for ClusterQueue representation](#use-self-balancing-search-trees-for-clusterqueue-representation)
    60  <!-- /toc -->
    61  
    62  ## Summary
    63  
    64  The enhancement extends the API of LocalQueue and ClusterQueue to expose the
    65  information about the order of their pending workloads.
    66  
    67  ## Motivation
    68  
    69  Currently, there is no visibility of the contents of the queues. This is
    70  problematic for Kueue users, who have no means to estimate when their jobs will
    71  start. Also, it is problematic for administrators, who would like to monitor
    72  the pipeline of pending jobs, and help users to debug issues.
    73  
    74  <!--
    75  This section is for explicitly listing the motivation, goals, and non-goals of
    76  this KEP.  Describe why the change is important and the benefits to users. The
    77  motivation section can optionally provide links to [experience reports] to
    78  demonstrate the interest in a KEP within the wider Kubernetes community.
    79  
    80  [experience reports]: https://github.com/golang/go/wiki/ExperienceReports
    81  -->
    82  
    83  ### Goals
    84  
    85  - expose the order of workloads in the LocalQueue and ClusterQueue
    86  
    87  <!--
    88  List the specific goals of the KEP. What is it trying to achieve? How will we
    89  know that this has succeeded?
    90  -->
    91  
    92  ### Non-Goals
    93  
    94  - expose the information about workload position for each pending workload in
    95    in case of very long queues
    96  
    97  <!--
    98  What is out of scope for this KEP? Listing non-goals helps to focus discussion
    99  and make progress.
   100  -->
   101  
   102  ## Proposal
   103  
   104  The proposal is to extend the APIs for the status of LocalQueue and ClusterQueue
   105  to expose the order of pending workloads. The order will be only exposed up to
   106  some configurable depth, in order to keep the size of the information
   107  constrained.
   108  
   109  <!--
   110  This is where we get down to the specifics of what the proposal actually is.
   111  This should have enough detail that reviewers can understand exactly what
   112  you're proposing, but should not include things like API designs or
   113  implementation. What is the desired outcome and how do we measure success?.
   114  The "Design Details" section below is for the real
   115  nitty-gritty.
   116  -->
   117  
   118  ### User Stories (Optional)
   119  
   120  <!--
   121  Detail the things that people will be able to do if this KEP is implemented.
   122  Include as much detail as possible so that people can understand the "how" of
   123  the system. The goal here is to make this feel real for users without getting
   124  bogged down.
   125  -->
   126  
   127  #### Story 1
   128  
   129  As a user of Kueue with LocalQueue visibility only, I would like to know the
   130  position of my workload in the ClusterQueue, I have no direct visibility into.
   131  Knowing the position, and assuming stable velocity in the ClusterQueue, would
   132  allow me to estimate the arrival time of my workload.
   133  
   134  #### Story 2
   135  
   136  As an administrator of Kueue with ClusterQueue visibility I would like to be
   137  able to check directly and compare positions of pending workloads in the queue.
   138  This will help me to answer users' questions about their workloads.
   139  
   140  Note that, merging the information exposed by individual local queues is not
   141  enough, because they may be showing inconsistent data due to delays in updates.
   142  For example, two workloads in different local queues may return the same
   143  position in ClusterQueue.
   144  
   145  ### Notes/Constraints/Caveats (Optional)
   146  
   147  <!--
   148  What are the caveats to the proposal?
   149  What are some important details that didn't come across above?
   150  Go in to as much detail as necessary here.
   151  This might be a good place to talk about core concepts and how they relate.
   152  -->
   153  
   154  ### Risks and Mitigations
   155  
   156  #### Too large objects
   157  
   158  As the number of pending workloads is arbitrarily large there is a risk that the
   159  status information about the workloads may exceed the etcd limit of 1.5Mi on
   160  object size.
   161  
   162  Exceeding the etcd limit has a risk that the LocalQueue controller updates can
   163  fail.
   164  
   165  In order to mitigate this risk we introduce the `MaxCount` configuration
   166  parameter to limit the maximal number of pending workloads in the status.
   167  Additionally, limit the maximal value of the parameter to 4000, see
   168  also [Choosing the limits and defaults for MaxCount](#choosing-the-limits-and-defaults-for-maxcount).
   169  
   170  We should also note that large queue objects might be problematic for the
   171  kubernetes API server, even if the etcd limit is not exceeded. For example,
   172  when there are many LocalQueue instances with watches, because in that case
   173  the entire LocalQueue objects need to be sent though the watch channels.
   174  
   175  To mitigate this risk we also extend the Kueue's user-facing documentation to
   176  warn about setting this number high on clusters with many LocalQueue instances,
   177  especially, when watches on the objects are used.
   178  
   179  #### Status updates for pending workloads slowing down other operations
   180  
   181  The operation of computing and updating the list of top pending workloads can
   182  have a degrading impact on the overall performance of other Kueue operations.
   183  
   184  This risk exists because the operation requires iteration over the contents of
   185  the cluster queue, which requires a read lock on the queue. Also, positional
   186  changes to the list of pending workloads may require more frequent updates if
   187  attempt to keep the information up-to-date.
   188  
   189  In order to mitigate the risk we maintain the statuses on best-effort basis,
   190  and issue at most one update request in a configured interval,
   191  see [throttling of status updates](#throttling-of-status-updates).
   192  
   193  Additionally, we take periodically an in-memory snapshot of the ClusterQueue to
   194  allow generation of the status with `MaxCount` elements for LocalQueues and
   195  ClusterQueues without taking the read lock for a prolonged time:
   196  [In-memory snapshot of the ClusterQueue](#In-memory snapshot of the ClusterQueue).
   197  
   198  #### Large number of API requests triggered after workload admissions
   199  
   200  In a scenario when we have multiple LocalQueues pointing to the same
   201  ClusterQueue a workload that is admitted in one LocalQueue shifts positions of
   202  pending workloads in other LocalQueues. In the worst case scenario updating the
   203  LocalQueue statuses with new positions requires as many API requests as the
   204  number of LocalQueues. In particular, sending over 100 requests after workload
   205  admission would degrade Kueue performance.
   206  
   207  First, we propose to batch the LocalQueue updates by time intervals. This helps
   208  to avoid sending API requests per LocalQueue if the positions are shifted
   209  multiple times in a short period of time.
   210  
   211  Second, we introduce the `MaxPosition` parameter configuration parameter. With
   212  this parameter, the number of LocalQueues requiring an update can be controlled,
   213  because only LocalQueues with workloads at the top positions require an update.
   214  
   215  Finally, setting the `MaxCount` parameter for LocalQueues to 0 allows to stop
   216  visibility updates to LocalQueues.
   217  
   218  <!--
   219  What are the risks of this proposal, and how do we mitigate? Think broadly.
   220  For example, consider both security and how this will impact the larger
   221  Kubernetes ecosystem.
   222  
   223  How will security be reviewed, and by whom?
   224  
   225  How will UX be reviewed, and by whom?
   226  
   227  Consider including folks who also work outside the SIG or subproject.
   228  -->
   229  
   230  ## Design Details
   231  
   232  The APIs of the status for LocalQueue and ClusterQueue are extended by
   233  structures which contain the list of pending workloads. In case of LocalQueue,
   234  also the workload position in the ClusterQueue is exposed.
   235  
   236  Updates to the structures are throttled, allowing for at most one update within
   237  a configured interval. Additionally, we take periodically an in-memory snapshot
   238  of the ClusterQueue.
   239  
   240  ### Local Queue API
   241  
   242  ```golang
   243  // LocalQueuePendingWorkload contains the information identifying a pending
   244  // workload in the local queue.
   245  type LocalQueuePendingWorkload struct {
   246  	// Name indicates the name of the pending workload.
   247  	Name string
   248  
   249  	// Position indicates the position of the workload in the cluster queue.
   250  	Position *int32
   251  }
   252  
   253  type LocalQueuePendingWorkloadsStatus struct {
   254  	// Head contains the list of top pending workloads.
   255  	// +listType=map
   256  	// +listMapKey=name
   257  	// +optional
   258  	Head []LocalQueuePendingWorkload
   259  
   260  	// LastChangeTime indicates the time of the last change of the structure.
   261  	LastChangeTime metav1.Time
   262  }
   263  
   264  // LocalQueueStatus defines the observed state of LocalQueue
   265  type LocalQueueStatus struct {
   266  ...
   267  	// PendingWorkloadsStatus contains the information exposed about the current
   268  	// status of pending workloads in the local queue.
   269  	// +optional
   270  	PendingWorkloadsStatus *LocalQueuePendingWorkloadsStatus
   271  ...
   272  }
   273  ```
   274  
   275  ### Cluster Queue API
   276  
   277  ```golang
   278  // ClusterQueuePendingWorkload contains the information identifying a pending workload
   279  // in the cluster queue.
   280  type ClusterQueuePendingWorkload struct {
   281  	// Name indicates the name of the pending workload.
   282  	Name string
   283  
   284  	// Namespace indicates the name of the pending workload.
   285  	Namespace string
   286  }
   287  
   288  type ClusterQueuePendingWorkloadsStatus struct {
   289  	// Head contains the list of top pending workloads.
   290  	// +listType=map
   291  	// +listMapKey=name
   292  	// +listMapKey=namespace
   293  	// +optional
   294  	Head []ClusterQueuePendingWorkload
   295  
   296  	// LastChangeTime indicates the time of the last change of the structure.
   297  	LastChangeTime metav1.Time
   298  }
   299  
   300  // ClusterQueueStatus defines the observed state of ClusterQueueStatus
   301  type ClusterQueueStatus struct {
   302  ...
   303  	// PendingWorkloadsStatus contains the information exposed about the current
   304  	// status of the pending workloads in the cluster queue.
   305  	// +optional
   306  	PendingWorkloadsStatus *ClusterQueuePendingWorkloadsStatus
   307  ...
   308  }
   309  ```
   310  
   311  ### Configuration API
   312  
   313  ```golang
   314  // Configuration is the Schema for the kueueconfigurations API
   315  type Configuration struct {
   316  ...
   317  	// QueueVisibility is configuration to expose the information about the top
   318  	// pending workloads.
   319  	QueueVisibility *QueueVisibility
   320  }
   321  
   322  type QueueVisibility struct {
   323  	// LocalQueues is configuration to expose the information
   324  	// about the top pending workloads in the local queue.
   325  	LocalQueues *LocalQueueVisibility
   326  
   327  	// ClusterQueues is configuration to expose the information
   328  	// about the top pending workloads in the cluster queue.
   329  	ClusterQueues *ClusterQueueVisibility
   330  
   331  	// UpdateInterval specifies the time interval for updates to the structure
   332  	// of the top pending workloads in the queues.
   333  	// Defaults to 5s.
   334  	UpdateInterval time.Duration
   335  }
   336  
   337  type LocalQueueVisibility struct {
   338  	// MaxCount indicates the maximal number of pending workloads exposed in the
   339  	// local queue status. When the value is set to 0, then LocalQueue visibility
   340  	// updates are disabled.
   341  	// The maximal value is 4000.
   342  	// Defaults to 10.
   343  	MaxCount int32
   344  
   345  	// MaxPosition indicates the maximal position of the workload in the cluster
   346  	// queue returned in the head.
   347  	MaxPosition *int32
   348  }
   349  
   350  type ClusterQueueVisibility struct {
   351  	// MaxCount indicates the maximal number of pending workloads exposed in the
   352  	// cluster queue status.  When the value is set to 0, then LocalQueue
   353  	// visibility updates are disabled.
   354  	// The maximal value is 4000.
   355  	// Defaults to 10.
   356  	MaxCount int32
   357  }
   358  ```
   359  
   360  ### In-memory snapshot of the ClusterQueue
   361  
   362  In order to be able to quickly compute the top pending workloads per LocalQueue,
   363  without a need for a prolonged read lock on the ClusterQueue, we create
   364  periodically in-memory snapshot of the ClusterQueue, organized as a map
   365  from the LocalQueue to the list of workloads belonging to the ClusterQueue,
   366  along with their positions. Then, the LocalQueue and ClusterQueue controllers
   367  do lookup into the cached structure.
   368  
   369  The snapshots are taken periodically, per ClusterQueue, by multiple workers
   370  processing a queue of snapshot-taking tasks. The tasks are re-enqueued to the
   371  queue with `QueueVisibility.UpdateInterval` delay just after taking the previous
   372  snapshot for as long as a given ClusterQueue exists.
   373  
   374  The model of using snapshot workers allows to control the number of snapshot
   375  updates after Kueue startup, and thus cascading ClusterQueues updates. The
   376  number of workers is 5.
   377  
   378  Note that taking the snapshot requires taking the ClusterQueue read lock
   379  only for the duration of copying the underlying heap data
   380  
   381  When `MaxCount` for both LocalQueues and ClusterQueues is 0, then the feature
   382  is disabled, and the snapshot is not computed.
   383  
   384  ### Throttling of status updates
   385  
   386  The updates to the structure of top pending workloads for LocalQueue (or
   387  ClusterQueue) are managed by the LocalQueue controller (or ClusterQueue controller)
   388  and are part of regular status updates of the queue.
   389  
   390  The updates to the structure of the pending workloads are generated based on the
   391  periodically taken snapshot.
   392  
   393  In particular, when LocalQueue reconciles, and the `LastChangeTime` indicates
   394  that `QueueVisibility.UpdateInterval` elapsed, then we generate the new structure
   395  based on the snapshot. If there is a change to the structure, then `LastChangeTime`
   396  is bumped, and the request is sent. If there is no change to the structure,
   397  then the controller enqueues another reconciliation when the snapshot will be
   398  regenerated.
   399  
   400  ### Choosing the limits and defaults for MaxCount
   401  
   402  One constraining factor for the default for `MaxCount` is the maximal object
   403  size for etcd, see [Too large objects](#too-large-objects).
   404  
   405  A similar consideration was done for the [Backoff Limit Per Index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs#the-job-object-too-big)
   406  feature where we set the parameter limits to constrain the size of the object in
   407  the worst case scenario around 500Ki. Such approach allows to stay relatively
   408  far from the 1.5Mi limit, and allow future extensions of the structures.
   409  
   410  Following this approach in case of Kueue we are limiting the `MaxCount`
   411  parameter to `4000` for ClusterQueues and LocalQueues. This translates to
   412  around `4000*63*2=0.48Mi` for ClusterQueues, and `4000*(63+4)=0.26Mi` for
   413  LocalQueues.
   414  
   415  The defaults are tuned for lower-scale usage in order to minimize the risk of
   416  issues on upgrading Kueue, as the feature is going to be enabled by default.
   417  For comparison, the Backoff Limit Per Index, the feature is opted-in per Job, so
   418  the consequences of issues are smaller that when the feature is enabled for
   419  all workloads.
   420  
   421  Similarly, we default the `MaxPosition` configuration parameter for LocalQueues
   422  to `10`. This parameter allows to control the number of LocalQueues which
   423  are updated after a workload admission (see also:
   424  [Large number of API requests triggered after workload admissions](#large-number-of-api-requests-triggered-after-workload-admissions)).
   425  
   426  Enabling the feature by default will allow more users to discover the feature.
   427  Then, based on their needs and setup they can increase the `MaxCount` and
   428  `MaxPosition` parameters.
   429  
   430  ### Limitation of the approach
   431  
   432  We acknowledge the limitation of the proposed approach that only to N workloads
   433  are exposed. This might be problematic for some large-scale setups.
   434  
   435  This means that the feature may be superseded by one of the
   436  [Alternative approaches](#alternative-approaches) in the future, and potentially
   437  be deprecated.
   438  
   439  Still, we believe it makes sense to proceed with the proposed approach as it is
   440  relatively simple to implement, and will already start providing value to
   441  the Kueue users with relatively small setups.
   442  
   443  Finally, the proposed solution is likely to co-exist with another alternative,
   444  because it would be advantageous in a smaller scale. Finally, the internal code
   445  extensions, such as the in-memory snapshot for the ClusterQueue, are likely to
   446  be reused as a building block for other approaches.
   447  
   448  <!--
   449  This section should contain enough information that the specifics of your
   450  change are understandable. This may include API specs (though not always
   451  required) or even code snippets. If there's any ambiguity about HOW your
   452  proposal will be implemented, this is the place to discuss them.
   453  -->
   454  
   455  ### Test Plan
   456  
   457  <!--
   458  **Note:** *Not required until targeted at a release.*
   459  The goal is to ensure that we don't accept enhancements with inadequate testing.
   460  
   461  All code is expected to have adequate tests (eventually with coverage
   462  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   463  when drafting this test plan.
   464  
   465  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   466  -->
   467  
   468  [x] I/we understand the owners of the involved components may require updates to
   469  existing tests to make this code solid enough prior to committing the changes necessary
   470  to implement this enhancement.
   471  
   472  ##### Prerequisite testing updates
   473  
   474  <!--
   475  Based on reviewers feedback describe what additional tests need to be added prior
   476  implementing this enhancement to ensure the enhancements have also solid foundations.
   477  -->
   478  
   479  #### Unit Tests
   480  
   481  <!--
   482  In principle every added code should have complete unit test coverage, so providing
   483  the exact set of tests will not bring additional value.
   484  However, if complete unit test coverage is not possible, explain the reason of it
   485  together with explanation why this is acceptable.
   486  -->
   487  
   488  <!--
   489  Additionally, try to enumerate the core package you will be touching
   490  to implement this enhancement and provide the current unit coverage for those
   491  in the form of:
   492  - <package>: <date> - <current test coverage>
   493  
   494  This can inform certain test coverage improvements that we want to do before
   495  extending the production code to implement this enhancement.
   496  -->
   497  
   498  - `<package>`: `<date>` - `<test coverage>`
   499  
   500  #### Integration tests
   501  
   502  The integration tests will cover scenarios:
   503  - the local queue status is updated when a workload in this local queue is added,
   504    preempted or admitted,
   505  - the addition of a workload to one local queue triggers and update of the
   506    structure in another local queue connected with the same cluster queue,
   507  - changes of the workload positions beyond the configured threshold for top
   508    pending workloads don't trigger an update of the pending workloads status.
   509  
   510  <!--
   511  Describe what tests will be added to ensure proper quality of the enhancement.
   512  
   513  After the implementation PR is merged, add the names of the tests here.
   514  -->
   515  
   516  ### Graduation Criteria
   517  
   518  #### Beta
   519  
   520  First iteration (0.5):
   521  
   522  - support visibility for ClusterQueues
   523  
   524  Second iteration (0.6):
   525  
   526  - support visibility for LocalQueues, but without positions,
   527    to avoid the complication of avoiding the risk [Large number of API requests triggered after workload admissions](#large-number-of-api-requests-triggered-after-workload-admissions)
   528  
   529  Third iteration (0.7):
   530  
   531  - reevaluate the need for exposing positions and support if needed
   532  
   533  #### Stable
   534  
   535  - drop the feature gate
   536  
   537  <!--
   538  
   539  Clearly define what it means for the feature to be implemented and
   540  considered stable.
   541  
   542  If the feature you are introducing has high complexity, consider adding graduation
   543  milestones with these graduation criteria:
   544  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   545  - [Feature gate][feature gate] lifecycle
   546  - [Deprecation policy][deprecation-policy]
   547  
   548  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   549  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   550  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   551  -->
   552  
   553  ## Implementation History
   554  
   555  <!--
   556  Major milestones in the lifecycle of a KEP should be tracked in this section.
   557  Major milestones might include:
   558  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   559  - the `Proposal` section being merged, signaling agreement on a proposed design
   560  - the date implementation started
   561  - the first Kubernetes release where an initial version of the KEP was available
   562  - the version of Kubernetes where the KEP graduated to general availability
   563  - when the KEP was retired or superseded
   564  -->
   565  
   566  ## Drawbacks
   567  
   568  <!--
   569  Why should this KEP _not_ be implemented?
   570  -->
   571  
   572  ## Alternatives
   573  
   574  ### Alternative approaches
   575  
   576  The alternatives are designed to solve the limitation for the maximal number of
   577  pending workloads which is returned in the status.
   578  
   579  #### Coarse-grained ordering information per workload in workload status
   580  
   581  The idea is to distribute the ordering information among workloads to avoid
   582  keeping the ordering information centralized, thus avoiding creating objects
   583  constrained by the etcd limit.
   584  
   585  The main complication with distributing the ordering information is that a
   586  workload admission, or a new workload with a high priority can move the entire
   587  ordering, warranting update requests to all workloads in the queue. This could
   588  mean cascades of thousands of requests after such event.
   589  
   590  The proposal to control the number of update requests to workloads when a
   591  workload is admitted or added, is to bucket workload positions. The bucket
   592  intervals could grow exponentially, allowing for logarithmic number of requests
   593  needed. With this approach, the number of requests to update workloads is limited
   594  by the number of buckets, as only the workloads on bucket boundary are updated.
   595  
   596  The update requests could be sent by a periodic routine which iterates over the
   597  cluster queue and triggers workload reconciliation for workloads for which the
   598  ordering is changed.
   599  
   600  Pros:
   601  - allows to expose the ordering information for all workloads, guaranteeing the
   602    user to know its workload position even if it is beyond the top N threshold
   603    in the proposed approach.
   604  
   605  Cons:
   606  - it requires a substantial number of requests when a workload is admitted, or
   607    a high priority workload is inserted. For example, assuming 1000 workloads,
   608    and expotential bucketing with base 2, this is 10 requests.
   609  - it is not clear if the coarse-grained information would satisfy user
   610    expectations. For example, a user may need to wait long to observe reduction
   611    of a bucket.
   612  - an external system which wants to display a pipeline of workloads needs to
   613    fetch all workloads. Similarly, as system which wants to list top 10 workloads
   614    may need to query all workloads.
   615  - a natural extension of the mechanism to return ETA in the workload status
   616    may also increase the number of requests in a less controlled way.
   617  
   618  #### Ordering information per workload in events or metrics
   619  
   620  The motivation for this approach is similar as for distributing the information
   621  in workload statuses. However, it builds on the assumption that update requests
   622  are more costly than events or metric updates. For example, sending events or
   623  updating metrics does not trigger a workload reconciliation.
   624  
   625  Pros:
   626  - more lightweight than updating workload status,
   627  
   628  Cons:
   629  - the API based on events or metrics would be less convenient to end users than
   630    object-based.
   631  - probably still requires bucketing, thus inheriting the usability cons related
   632    to bucking from the workload status approach.
   633  
   634  #### On-demand http endpoint
   635  
   636  The idea is that Kueue exposes an endpoint which allows to fetch the ordering
   637  information for all pending workloads, or for a selected workloads.
   638  
   639  Pros:
   640  - eliminates wasting QPS for updating kubernetes objects
   641  
   642  Cons:
   643  - the API will lack of the API server features, such as watches or P&F throttling,
   644    load-balancing. Also, the ensuring security of the new workload might be
   645    more involving, making it technically challenging.
   646  
   647  One possible way of to deal with the security concern of
   648  [On-demand http endpoint](#on-demand-http-endpoint) is to use
   649  [Extension API Server](https://kubernetes.io/docs/tasks/extend-kubernetes/setup-extension-api-server/),
   650  exposed via
   651  [API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/).
   652  Then, the aggregation layer could take the responsibility of authenticating and
   653  authorizing the requests.
   654  
   655  ### Alternatives within the proposal
   656  
   657  Here are some alternatives to solve smaller problems within the realm of the
   658  proposal.
   659  
   660  #### Unlimited MaxCount parameter
   661  
   662  The `MaxCount` parameter constrains the maximal size of the ClusterQueue and
   663  LocalQueue statuses to ensure that the object size limit of etcd is not exceeded,
   664  see [Too large objects](#too-large-objects).
   665  
   666  The actual maximal number might depends of the lengths of the names of namespaces
   667  and names. Such names typically will be far from the maximum. In particular,
   668  the namespaces might be created based on team names, which may have an internal
   669  policy of not exceeding, say 100, characters. In that case, the estimation
   670  would be too constraining. We propose to add a soft warning when 2000 is
   671  exceeded, and warn in documentation.
   672  
   673  **Reasons for discarding/deferring**
   674  
   675  Setting hard limits for the parameters allows to avoid users to crash their
   676  systems. We will re-evaluate the decision based on users feedback. One alternative
   677  is to make the limit soft, rather than hard. Another is to implement and support
   678  another alternative solution for large-scale usage.
   679  
   680  #### Expose the pending workloads only for LocalQueues
   681  
   682  It was proposed, that for administrators, with full access to the cluster we
   683  could have an alternative approaches, which don't involve the status of the
   684  ClusterQueue.
   685  
   686  **Reasons for discarding/deferring**
   687  
   688  The solution proposed for LocalQueues is easy to transfer for ClusterQueues.
   689  Developing another approach just focused on admins might be problematic.
   690  
   691  #### Do not expose ClusterQueue positions in LocalQueues
   692  
   693  It was proposed, that without exposing the positions in the cluster queues we
   694  don't need to update LocalQueues when workloads from another LocalQueue are
   695  admitted, or added to. Additionally, the positional information does not reveal
   696  much about the actual time to admit the workloads, the other workloads might
   697  be small or big.
   698  
   699  **Reasons for discarding/deferring**
   700  
   701  First, getting to know the positional information gives some hits about the
   702  expected arrival time. Especially as users of the systems gain some experience
   703  about the velocity of the ClusterQueue. In particular, it could be estimated,
   704  based on historical data, data that 10 workloads are admitted every 1h. This
   705  makes already a difference if a user knows that its workload is positioned
   706  1 or 100.
   707  
   708  With the throttling for updating the list of pending workloads the
   709  change in positional information will not trigger too many status updates.
   710  
   711  Also, even without positional information it is possible that an update is
   712  needed because while one workload is admitted another one is added. Such
   713  situations would require additional updates, so we should introduce some
   714  throttling mechanism for updates.
   715  
   716  #### Use self-balancing search trees for ClusterQueue representation
   717  
   718  Using self-balancing search trees for ClusterQueue could be used to quickly
   719  provide the list of top workloads in ClusterQueue.
   720  
   721  **Reasons for discarding/deferring**
   722  
   723  It does not solve the issue of exposing the information for LocalQueues. If
   724  we have many (or just multiple) LocalQueues pointing to the same ClusterQueue,
   725  each of them would need to take a read lock for the iteration, and potentially
   726  iterate over the entire ClusterQueue.
   727  
   728  <!--
   729  What other approaches did you consider, and why did you rule them out? These do
   730  not need to be as detailed as the proposal, but should include enough
   731  information to express the idea and why it was not acceptable.
   732  -->