sigs.k8s.io/kueue@v0.6.2/keps/168-2-pending-workloads-visibility/README.md

sigs.k8s.io/kueue@v0.6.2/keps/168-2-pending-workloads-visibility/README.md (about)

     1  # KEP-168-2: Pending-workloads-visibility
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Cons of the current solution](#cons-of-the-current-solution)
    23      - [Size of the queue](#size-of-the-queue)
    24      - [Consistency across all LocalQueues](#consistency-across-all-localqueues)
    25      - [Expanding API in the future](#expanding-api-in-the-future)
    26      - [Delay](#delay)
    27    - [Goals](#goals)
    28    - [Non-Goals](#non-goals)
    29  - [Proposal](#proposal)
    30    - [User Stories](#user-stories)
    31      - [Story 1](#story-1)
    32      - [Story 2](#story-2)
    33      - [Story 3](#story-3)
    34    - [Risks and Mitigations](#risks-and-mitigations)
    35      - [DDoS](#ddos)
    36      - [Payload size](#payload-size)
    37  - [Design Details](#design-details)
    38    - [API Details](#api-details)
    39    - [API endpoints:](#api-endpoints)
    40      - [List pending workloads in ClusterQueue](#list-pending-workloads-in-clusterqueue)
    41      - [List pending workloads in LocalQueue](#list-pending-workloads-in-localqueue)
    42    - [API Objects:](#api-objects)
    43    - [Future extensions](#future-extensions)
    44    - [Test Plan](#test-plan)
    45      - [Overview](#overview)
    46      - [Unit Tests](#unit-tests)
    47      - [Integration tests](#integration-tests)
    48      - [E2E tests](#e2e-tests)
    49    - [Graduation Criteria](#graduation-criteria)
    50      - [GA](#ga)
    51  - [Implementation History](#implementation-history)
    52  - [Drawbacks](#drawbacks)
    53  - [Alternatives](#alternatives)
    54    - [Alternative approaches](#alternative-approaches)
    55      - [Approach described in <a href="https://github.com/kubernetes-sigs/kueue/tree/main/keps/168-pending-workloads-visibility">KEP#168</a>](#approach-described-in-kep168)
    56      - [Extend API using CRDs](#extend-api-using-crds)
    57    - [Alternatives within the proposal](#alternatives-within-the-proposal)
    58      - [apiserver-builder library](#apiserver-builder-library)
    59  <!-- /toc -->
    60  
    61  ## Summary
    62  
    63  This KEP proposes to introduce a new API that allows users to on-demand fetch information about pending workloads in both ClusterQueue and LocalQueue. Users will be able to look up the position of a specific workload in both types of queues and list pending workloads in a specific queue.
    64  
    65  ## Motivation
    66  
    67  As presented in [KEP#168](https://github.com/kubernetes-sigs/kueue/tree/main/keps/168-pending-workloads-visibility), there is currently a proposal for a mechanism that supports fetching the order of pending workloads, but it comes with a lot of cons. This proposal addresses all of those problems.
    68  
    69  ### Cons of the current solution
    70  
    71  #### Size of the queue
    72  
    73  There are a few scalability concerns. The first one is that the number of fetched pending workloads is limited by the etcd object's size limit. By default, a user is able to fetch only 10 workloads stored at the head of a queue. This number can be increased up to 4000, but comes with a performance loss.
    74  
    75  #### Consistency across all LocalQueues
    76  
    77  Another scalability drawback is that in a Kueue setup with a lot of LocalQueues it is very likely to hit the Kueue QPS. Assuming Kueue setup with multiple LocalQueues pointing to the same ClusterQueue, Kueue needs to send updates to all LocalQueues to update their status, in order to keep the workload positional information up-to-date. This consumes `QPS` which can lead to blocking other requests. Although we can use a client separate from a default one, it would not not completely resolve all scalability issues.
    78  
    79  #### Expanding API in the future
    80  
    81  Moreover, there are some functional issues with the current approach. It does not expose any information about pending workloads except for ```name```, ```namespace```, and position in a queue (by listing workloads in order). Adding new fields would result in a decrease in the potential maximum number of fetched pending workloads, caused by the etcd object's size.
    82  
    83  #### Delay
    84  
    85  Additionally, in the previous proposal, Kueue updated the most prioritized workloads every 5 seconds. It is configurable, but since computing the most prioritized workloads can be expensive, it cannot be significantly reduced.
    86  Users can observe outdated information, which might not be convenient.
    87  
    88  
    89  ### Goals
    90  
    91  - Support listing pending workloads on positions from X to Y in a ClusterQueue, no matter the size of the queue, and without delay,
    92  - Support listing pending workloads on positions from X to Y in a LocalQueue, no matter the size of the queue, and without delay,
    93  - Provide consitent data across all the LocalQueues without hitting `QPS`.
    94  
    95  ### Non-Goals
    96  
    97  - Provide ETA (Estimated Time of Arrival) for a workload,
    98  - Provide information on whether a workload is admissible,
    99  - Provide information about the requested resource for a workload,
   100  
   101  ## Proposal
   102  Add new API exposing information about pending workloads relevant for their position in the queue, along with the position itself. There are two such endpoints:
   103  1. List the pending workloads in ClusterQueue,
   104  2. List the pending workloads in LocalQueue,
   105  
   106  In order to expose the API endpoints we introduce a new Extension API server.
   107  
   108  ### User Stories
   109  
   110  #### Story 1 
   111  
   112  As a user of Kueue with LocalQueue visibility only, I would like to know the position in the ClusterQueue of a workload that I've just submitted, no matter how big the queue is. Knowing the position and assuming stable velocity in the ClusterQueue, would allow me to estimate the arrival time of my workload.
   113  
   114  Provided by the [LocalQueue endpoint](#list-all-pending-workloads-in-localqueue).
   115  
   116  #### Story 2 
   117  
   118  As an administrator of Kueue with ClusterQueue visibility, I would like to be able to check directly and compare the positions of pending workloads in the queue, no matter the size of it. It is important that data across all LocalQueue is consistent, and no two workloads have the same position in ClusterQueue. This will help me answer users' questions about their workloads.
   119  
   120  Provided by the [ClusterQueue endpoint](#list-all-pending-workloads-in-clusterqueue).
   121  
   122  #### Story 3 
   123  
   124  As a developer who uses Kueue, I would like to be able to monitor the state of my ClusterQueue/LocalQueue using dashboards. I need a mechanism that allows me to easily build it.
   125  
   126  Provided by the [ClusterQueue endpoint](#list-all-pending-workloads-in-clusterqueue) and the [LocalQueue endpoint](#list-all-pending-workloads-in-localqueue).
   127  
   128  
   129  ### Risks and Mitigations
   130  
   131  #### DDoS
   132  
   133  One risk we foresee is that the server may be exposed to DDoS attacks. A potential attacker may flood the server with requests, which will result in constantly locking the Kueue Manager. To mitigate this risk, we plan on relying on throttling, so that even with numerous requests, the Kueue Manager remains functional. The first approach we propose is to use [API server P&F mechanism](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/). Additionally, based on the user feedback, we may consider another caching mechanism inside Kueue.
   134  
   135  #### Payload size
   136  
   137  Another risk we took into account is that the payload size would be too large in the case of 100k pending workloads. However, in the worst case scenario (which we foresee as a rather unrealistic one, since it would mean all string fields would be filled with 256 chars) its size is about 1,4kB. Even with 100k pending workloads, it takes 140 MB, which is still a reasonable number compared to `metrics-server's` payloads. Hence, we believe it should not be a concern. This is also mitigated by the [query parameters](#api-details) introduced below.
   138  
   139  ## Design Details
   140  
   141  The proposal introduces a new server running on the Kueue's pod. It computes the current state of KueueManager without any additional request overhead. The server uses the [K8s API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) mechanism. The same mechanism is used by the [metrics-server](https://github.com/kubernetes-sigs/metrics-server). There will be no additional etcd objects or need to use existing ones. No additional requests to sync information across LocalQueues will be required. 
   142  
   143  Similarly to the ```metrics-server``` the server will be implemented with the [apiserver library](https://github.com/kubernetes/apiserver), which provides authentication and authorization.
   144  
   145  All computation will be done on-demand without additional reconcile loops.
   146  
   147  The server provides that:
   148  - Pending workloads are returned according to their actual status without significant delay. This includes adding and removing/admitting new workloads with various priorities,
   149  - Adding workloads to one LocalQueues results in position changes for workloads submitted to other LocalQueues,
   150  - Data is consistent across all LocalQueues,
   151  - User with only LocalQueue visibility cannot access the list of pending workloads for ClusterQueue.
   152  
   153  ### API Details
   154  
   155  We introduce a new API that will extend the existing one.
   156  
   157  There will be separate endpoints exposing the information about pending workloads for LocalQueues, and ClusterQueues. Each endpoint exposes information about a pending workload, such as:
   158  - workload's position in a ClusterQueue,
   159  - workload's position in a LocalQueue,
   160  - workload's priority,
   161  - creation timestamp.
   162  
   163  The API does not allow for the modification of any objects.
   164  
   165  Regular users will have access only to the LocalQueues they are assigned to. However, they will be able to fetch information about the global position of a workload in a ClusterQueue, without any details about workloads in different LocalQueues. 
   166  
   167  Administrators will have access to all the data at the ClusterQueue level. They will be able to view all the workloads, no matter the LocalQueues the workloads are assigned to.
   168  
   169  The API also allows user to fetch information about part of the Cluster/LocalQueue from position X to Y. There are two query parameters to do so:
   170  - `offset` indicates position of the first fetched workload - default: `0`
   171  - `limit` indicates max number of workloads to be fetched - default: `1000`
   172  
   173  Thanks to these parameters our server also support pagination.
   174  
   175  ### API endpoints:
   176  
   177  We introduce a new API group ```visibility.kueue.x-k8s.io``` that aggregates following endpoints: 
   178  
   179  #### List pending workloads in ClusterQueue
   180  
   181  ```
   182  GET /apis/visibility.kueue.x-k8s.io/VERSION/clusterqueues/CQ_NAME/pendingworkloads?offset=0&limit=1000
   183  ```
   184  
   185  #### List pending workloads in LocalQueue
   186  ```
   187  GET /apis/visibility.kueue.x-k8s.io/VERSION/namespaces/LQ_NAMESPACE/localqueues/LQ_NAME/pendingworkloads?offset=0&limit=1000
   188  ```
   189  
   190  Those endpoints can be accessed using `kubectl get --raw <ENDPOINT_PATH>` command.
   191  
   192  Another way to access API is to use a client generated with `k8s.io/code-generator`, similarly to the core Kueue API.
   193  
   194  ### API Objects:
   195  
   196  ```
   197  // PendingWorkload is a user-facing representation of a pending workload in both LocalQueues and ClusterQueue that summarizes neccessary information from the admission order perspective
   198  type PendingWorkload struct {
   199  	TypeMeta TypeMeta
   200  	ObjectMeta ObjectMeta
   201  
   202  	LocalQueueName string
   203  	PositionInClusterQueue int32
   204  	PositionInLocalQueue int32
   205  	Priority int32
   206  }
   207  
   208  // PendingWorkloadSummary contains a list of pending workloads in the context
   209  // of the query (within LocalQueue or ClusterQueue).
   210  type PendingWorkloadsSummary struct {
   211  	TypeMeta TypeMeta
   212  	ListMeta ListMeta
   213  
   214  	Items []PendingWorkload
   215  }
   216  ```
   217  
   218  A user can easily identify the Job that is an owner of the pending workload. To enable it, the API uses `metav1.OwnerReferences` field to indicate the owner, typically the job created by the user. 
   219  
   220  ### Future extensions
   221  
   222  The introduced API uses mechanism of subresources. It means, that in the future it can be easily extended by adding additional endpoints related e.g. to admitted workloads. Potentially the endpoint could look like this:
   223  
   224  ```
   225  GET /apis/visibility.kueue.x-k8s.io/VERSION/clusterqueues/CQ_NAME/admitted_workloads?offset=0&limit=1000
   226  ```
   227  
   228  ### Test Plan
   229  
   230  [X] I/we understand the owners of the involved components may require updates to
   231  existing tests to make this code solid enough prior to committing the changes necessary
   232  to implement this enhancement.
   233  
   234  #### Overview
   235  
   236  Our main focus is integration tests, as most of the added code is responsible for integrating with the Kueue and RBAC roles.
   237  
   238  #### Unit Tests
   239  
   240  We plan on adding unit tests that cover getting a list of pending workloads at the KueueManager level.
   241  
   242  - `pkg/visibility`: `30 Oct 2023` - `0%`
   243  
   244  #### Integration tests
   245  
   246  Integration tests should check if our server work correctly according to the assumptions we mentioned:
   247  - Pending workloads are returned according to their actual status without delay.
   248  - Adding workloads to one LocalQueues results in position changes for workloads submitted to other LocalQueues,
   249  - Data is consistent across all LocalQueues
   250  - User with only LocalQueue visibility cannot access the list of pending workloads for ClusterQueue
   251  
   252  #### E2E tests
   253  
   254  We plan on adding sanity e2e tests, and RBAC e2e tests. The e2e RBAC tests should cover scenarios:
   255  - clusters queues can only be accessed by admin users
   256  - local queues can be accessed by only users with the visibility to the corresponding namespaces
   257  
   258  ### Graduation Criteria
   259  
   260  First iteration (0.6):
   261  - Release the new API in alpha. This allows us to adjust the API according to users' and reviewers' feedback,
   262  - Release it with a feature gate.
   263  
   264  Second iteration (0.7):
   265  - Release the API in beta and guarantee backwards compatibility,
   266  - Reconsider introducing a throttling mechanism based on user and review feedback,
   267  - Consider introducing FlowScheme and PriorityLevelConfiguration to allow admins to easily tune API priorities.
   268  
   269  #### GA
   270  The feature can graduate to GA after addressing feedback for at least 1 release. We will then drop the feature gate.
   271  
   272  ## Implementation History
   273  
   274  <!--
   275  Major milestones in the lifecycle of a KEP should be tracked in this section.
   276  Major milestones might include:
   277  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   278  - the `Proposal` section being merged, signaling agreement on a proposed design
   279  - the date implementation started
   280  - the first Kubernetes release where an initial version of the KEP was available
   281  - the version of Kubernetes where the KEP graduated to general availability
   282  - when the KEP was retired or superseded
   283  -->
   284  
   285  ## Drawbacks
   286  
   287  ## Alternatives
   288  
   289  ### Alternative approaches
   290  
   291  #### Approach described in [KEP#168](https://github.com/kubernetes-sigs/kueue/tree/main/keps/168-pending-workloads-visibility) 
   292  
   293  Use the status fields of ClusterQueues and LocalQueues
   294  
   295  **Pros:**
   296  - Partially already implemented
   297  
   298  **Cons:**
   299  - Described above and in the KEP
   300  
   301  #### Extend API using CRDs
   302  
   303  Extract information about the order of pending workloads to a separate CRD object.
   304  
   305  **Pros:**
   306  - Easy to set up
   307  
   308  **Cons:**
   309  - Does not address scalability concerns
   310  - Does not provide the position in the queue for an arbitrary workload, if it's not at the head of the Kueue
   311  
   312  ### Alternatives within the proposal
   313  
   314  #### apiserver-builder library
   315  
   316  There is an alternative library to the [apiserver library](https://github.com/kubernetes/apiserver) called [apiserver-builder](https://github.com/kubernetes-sigs/apiserver-builder-alpha). It seemed promising as it could potentially speed up the development. However, after researching this library we had concerns about its maintenance. The old dependencies, no recent commits or pull requests indicated that this project might be abandoned. We have contacted the last maintainer of this project and confirmed that there is no planned effort into maintaining it. He also confirmed our concerns, that due to old dependencies there might be some compatibility issues if we wanted to use the latest k8s release.
   317  
   318  **Pros:**
   319  - Faster development
   320  
   321  **Cons:**
   322  - Library is not maintained
   323  - Possible compatibility issues due to old dependencies