sigs.k8s.io/kueue@v0.6.2/keps/487-kubectl-plugin/README.md

sigs.k8s.io/kueue@v0.6.2/keps/487-kubectl-plugin/README.md (about)

     1  # KEP-487: Kubectl plugin for listing objects
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Goals](#goals)
    23    - [Non-Goals](#non-goals)
    24  - [Proposal](#proposal)
    25    - [User Stories (Optional)](#user-stories-optional)
    26      - [Story 1](#story-1)
    27      - [Story 2](#story-2)
    28    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    29    - [Risks and Mitigations](#risks-and-mitigations)
    30  - [Design Details](#design-details)
    31    - [Summary](#summary-1)
    32    - [Version](#version)
    33    - [Listing Jobs](#listing-jobs)
    34    - [Listing Workloads](#listing-workloads)
    35    - [Describe Workloads](#describe-workloads)
    36    - [Watch Workload](#watch-workload)
    37    - [Queues](#queues)
    38    - [Possible command extensions](#possible-command-extensions)
    39      - [Cancel Workloads](#cancel-workloads)
    40      - [Submit](#submit)
    41      - [Update](#update)
    42    - [Resources](#resources)
    43    - [Test Plan](#test-plan)
    44        - [Prerequisite testing updates](#prerequisite-testing-updates)
    45      - [Unit Tests](#unit-tests)
    46      - [Integration tests](#integration-tests)
    47    - [Graduation Criteria](#graduation-criteria)
    48  - [Implementation History](#implementation-history)
    49  - [Drawbacks](#drawbacks)
    50  - [Alternatives](#alternatives)
    51  <!-- /toc -->
    52  
    53  ## Summary
    54  
    55  We would like to develop a Kubectl plugin for Kueue that serves as a command-line workload management tool.
    56  It should be able to manage and list workloads and ClusterQueues with comprehesive information.
    57  
    58  
    59  <!--
    60  This section is incredibly important for producing high-quality, user-focused
    61  documentation such as release notes or a development roadmap. It should be
    62  possible to collect this information before implementation begins, in order to
    63  avoid requiring implementors to split their attention between writing release
    64  notes and implementing the feature itself. KEP editors and SIG Docs
    65  should help to ensure that the tone and content of the `Summary` section is
    66  useful for a wide audience.
    67  
    68  A good summary is probably at least a paragraph in length.
    69  
    70  Both in this section and below, follow the guidelines of the [documentation
    71  style guide]. In particular, wrap lines to a reasonable length, to make it
    72  easier for reviewers to cite specific portions, and to minimize diff churn on
    73  updates.
    74  
    75  [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
    76  -->
    77  
    78  ## Motivation
    79  
    80  The basic information provided by `kubectl get` does include details such as whether a workload is admitted or why it's pending. 
    81  As this information is available in compound objects, we need to perform different queries to the API to get it. 
    82  Similarly, `kubectl` cannot tell the status of a ClusterQueue that is misconfigured, nor does it
    83  provide details of an accompanying Workload listed alongside a job. We want to expose this information
    84  to the user, and in a way that feels comfortable and familiar.
    85  
    86  
    87  <!--
    88  This section is for explicitly listing the motivation, goals, and non-goals of
    89  this KEP.  Describe why the change is important and the benefits to users. The
    90  motivation section can optionally provide links to [experience reports] to
    91  demonstrate the interest in a KEP within the wider Kubernetes community.
    92  
    93  [experience reports]: https://github.com/golang/go/wiki/ExperienceReports
    94  -->
    95  
    96  ### Goals
    97  
    98  A successful plugin should be able to answer the following questions:
    99  
   100  1. What workloads are there in the user-queue?
   101  2. What workloads are there in this specific namespace?
   102  3. What workloads are there in this specific state?
   103  4. Why is my workload pending?
   104  5. Was there an error admitting my workload?
   105  6. Is a ClusterQueue misconfigured or some other issue? (this is more of an admin command?)
   106  7. All of the above, but instead of a table I want json/yaml
   107  
   108  And for this scoped work, the plugin should be added to the [krew plugins package manager](https://krew.sigs.k8s.io/plugins/).
   109  
   110  <!--
   111  List the specific goals of the KEP. What is it trying to achieve? How will we
   112  know that this has succeeded?
   113  -->
   114  
   115  ### Non-Goals
   116  
   117  - Serve as a duplicate implementation of information available via kubectl already
   118  - Streaming logs
   119  
   120  <!--
   121  What is out of scope for this KEP? Listing non-goals helps to focus discussion
   122  and make progress.
   123  -->
   124  
   125  ## Proposal
   126  
   127  We propose creating a command-line tool that can serve as a Kubectl plugin that exposes this missing information. We also propose using a design strategy that mimics existing tools that are available for other workload managers across HPC and cloud that users are comforable with to ease adoption of both the tool and approach of submitting workloads to Kubernetes. The main command-line interactions will take the following shape:
   128  
   129  ```bash
   130  # As a Kubectl plugin
   131  kubectl kueue <options> <command< <args>
   132  
   133  # As a standalone command-line tool
   134  kueue <options> <command> <args>
   135  ```
   136  
   137  For comparison, the Armada team developed a [general client](https://github.com/armadaproject/armada/tree/master/cmd/armadactl) that uses this strategy to be renamed and fold into kubectl. This is an ideal approach because it presents an interface to kueue to make it feel more like a standalone job manager (and not force folks to live in Kubernetes abstractions, at least entirely). 
   138  
   139  
   140  <!--
   141  This is where we get down to the specifics of what the proposal actually is.
   142  This should have enough detail that reviewers can understand exactly what
   143  you're proposing, but should not include things like API designs or
   144  implementation. What is the desired outcome and how do we measure success?.
   145  The "Design Details" section below is for the real
   146  nitty-gritty.
   147  -->
   148  
   149  ### User Stories (Optional)
   150  
   151  <!--
   152  Detail the things that people will be able to do if this KEP is implemented.
   153  Include as much detail as possible so that people can understand the "how" of
   154  the system. The goal here is to make this feel real for users without getting
   155  bogged down.
   156  -->
   157  
   158  #### Story 1
   159  
   160  As a user submitting workloads, I want to easily see the status of an entire Workload,
   161  or check on why my workload is not being admitted. I can do this with the new proposed plugin.
   162  
   163  ```bash
   164  kueue describe taco-123
   165  ```
   166  
   167  #### Story 2
   168  
   169  As a user coming from High Performance Computing (HPC), I am not comfortable with using
   170  `kubectl` and don't want to learn an entirely new means to interact with workloads.
   171  The proposed plugin makes the transition much easier for me. The example below
   172  compares the proposed list command for this plugin against popular HPC resource
   173  managers.
   174  
   175  ```bash
   176  # Kubernetes Kueue
   177  kueue workloads <queue-name>
   178  
   179  # SLURM
   180  squeue --partition <queue-name>
   181  
   182  # Flux Framework
   183  flux jobs -a --queue=<queue-name>
   184  
   185  # HTCondor
   186  condor_q
   187  ```
   188  
   189  While this is just a sampling, the high level idea is that high performance users
   190  are comfortable and accustomed to having command line tools to list and otherwise
   191  interact with jobs. Having a similar tool for Kubernetes, and specifically
   192  submitting jobs, will help to span the space.
   193  
   194  ### Notes/Constraints/Caveats (Optional)
   195  
   196  <!--
   197  What are the caveats to the proposal?
   198  What are some important details that didn't come across above?
   199  Go in to as much detail as necessary here.
   200  This might be a good place to talk about core concepts and how they relate.
   201  -->
   202  
   203  ### Risks and Mitigations
   204  
   205  Maintaining the plugin in-sync with the latest version of Kueue will add additional
   206  maintainer responsibilities.
   207  
   208  <!--
   209  What are the risks of this proposal, and how do we mitigate? Think broadly.
   210  For example, consider both security and how this will impact the larger
   211  Kubernetes ecosystem.
   212  
   213  How will security be reviewed, and by whom?
   214  
   215  How will UX be reviewed, and by whom?
   216  
   217  Consider including folks who also work outside the SIG or subproject.
   218  -->
   219  
   220  ## Design Details
   221  
   222  <!--
   223  This section should contain enough information that the specifics of your
   224  change are understandable. This may include API specs (though not always
   225  required) or even code snippets. If there's any ambiguity about HOW your
   226  proposal will be implemented, this is the place to discuss them.
   227  -->
   228  
   229  ### Summary
   230  
   231  Kueuectl is a command-line tool for interacting with Kueue, and can serve as a standalone tool or be renamed/installed to act as a kubectl plugin. Using `kueuectl` a user can manage workloads. An alternative (shorter and easier to type name) might also be `kueue`, which I'll use for the remainder of this document. Thus, the main client takes the following format:
   232  
   233  ```bash
   234  kueue [subcommand] [flags]
   235  ```
   236  
   237  As a kubectl plugin, we would install it named as `kubectl-kueue` and then the interaction would be:
   238  
   239  ```bash
   240  kubectl kueue [subcommand] [flags]
   241  ```
   242  
   243  We can start with basic query of workload metadata and status, and move toward a tool that can further create / delete or otherwise interact with workloads. This design document will proceed with proposed interactions and example tables, which would be printed in the terminal.
   244  
   245  
   246  ### Version
   247  
   248  ```bash
   249  kueue version
   250  ```
   251  
   252  ### Listing Jobs
   253  
   254  A user may be interested to list workloads based on job name. For this need, an API group and kind would be needed.
   255  The following should both work to produce the same result:
   256  
   257  ```bash
   258  kueue jobs --kind MPIJob <name>
   259  kueue jobs --kind kubeflow.org/MPIJob <name>
   260  ```
   261  
   262  | Namespace | Name     | API Group         | Kind        | Command  | Pods | Time   |     State |
   263  |-----------|----------|-------------------|-------------|----------|------|--------|-----------|
   264  | insects   | ant-123  | kubeflow.org      | MPIJob      | echo     | 2    | 4.323s | Completed | 
   265  
   266  
   267  ### Listing Workloads
   268  
   269  A user might also want to see "all" workloads. We will need to decide if "all" means all namespaces, or (akin to `kubectl` just those in default. 
   270  
   271  ```bash
   272  kueue workloads
   273  ```
   274  
   275  | Namespace | Name     | API Group         | Kind        | Pods | Time   |     State | Queue      |
   276  |---------- |------------------------------|-------------|------|--------|-----------|------------|
   277  | default   | taco-345 | flux-framework.org| MiniCluster | 3    | 2.322s |   Running | user-queue |
   278  | default   | taco-123 | flux-framework.org| MiniCluster | 2    |        |   Pending | user-queue | 
   279  | insects   | ant-123  | kubeflow.org      | MPIJob      | 2    | 4.323s | Completed | user-queue |
   280  
   281  Adding the name of q queue will filter down to it:
   282  
   283  ```bash
   284  # Generic command
   285  kueue workloads -q <queue-name>
   286  
   287  # As an example
   288  kueue workloads -q user-queue
   289  ```
   290  
   291  | Namespace | Name     | API Group         | Kind        | Pods | Time   |     State |
   292  |-----------|----------|-------------------|-------------|------|--------|-----------|
   293  |   default | taco-345 | flux-framework.org| MiniCluster | 3    | 2.322s |   Running | 
   294  |   default | taco-123 | flux-framework.org| MiniCluster | 2    |        |   Pending | 
   295  |   insects | ant-123  | kubeflow.org      | MPIJob      | 2    | 4.323s | Completed | 
   296  
   297  For namespaces, kueue will use the same practices as kubectl, using "default" for a default, and otherwise
   298  expecting a `-n` or `--namepsace` argument for a custom namespace.
   299  
   300  ```bash
   301  kueue workloads --namespace insects
   302  ```
   303  If a user wants to set a default namespace, 
   304  this [should be supported akin to kubectl](https://www.cloudytuts.com/tutorials/kubernetes/how-to-set-default-kubernetes-namespace/):
   305  
   306  ```bash
   307  kueue set default-namespace <namespace>
   308  kueue set default-namespace insects
   309  kueue workloads
   310  ```
   311  
   312  | Namespace | Name     | API Group         | Kind        | Pods | Time   |     State | Queue |
   313  |-----------|----------|-------------------|-------------|------|--------|-----------|-------|
   314  | inspects  | ant-123  | kubeflow.org      | MPIJob      | 2    | 4.323s | Completed | user-queue |
   315  
   316  
   317  We can also ask for a specific workload based on a job name. E.g., if I just submit a job and know the name, this would be intuitive to type.
   318  
   319  ```bash
   320  # kueue workloads <name>
   321  kueue workloads job/taco-123
   322  ```
   323  
   324  | Namespace| Name      | API Group         | Kind        | Pods | Time   |     State | Queue |
   325  |----------|-----------|-------------------|-------------|------|--------|-----------|-------|
   326  | default  | taco-123  | flux-framework.org| MiniCluster | 2    |        |   Pending | user-queue | 
   327  
   328  
   329  We likely want to also filter by state (or other attributes, TBA which others?)
   330  
   331  ```bash
   332  kueue workloads --state Pending
   333  ```
   334  
   335  | Namespace | Name     | API Group         | Kind        | Pods  | Time   |     State | Queue |
   336  |-----------|----------|-------------------|-------------|-------|--------|-----------|-------|
   337  | default   | taco-123 | flux-framework.org| MiniCluster |  2    |        |   Pending | user-queue | 
   338  
   339  
   340  ### Describe Workloads
   341  
   342  Describe is intended to show more detailed information about one or more workloads. Akin to kubectl describe, we would stack them on top of the other. Unlike kubectl, I think we should have the -o json/yaml options here (it never made sense to me that kubectl uses describe for more rich metadata, but those output variables are available with "get" !
   343  
   344  ```bash
   345  kueue describe taco-123
   346  kueue describe taco-123 taco-345
   347  ```
   348  ```console
   349  Feature Set A:
   350    Field A            Field B     Field C
   351    --------           --------    ------
   352    cpu                650m (8%)   0 (0%)
   353    memory             100Mi (0%)  0 (0%)
   354  Events:
   355    Field A    Field B              Field C    
   356    ----       -------              ------- ... 
   357    Normal     Starting             Pancakes.
   358  ```
   359  
   360  While the describe tables do not include images or commands, the detailed view should.
   361  Note that I likely will develop this when I dig into working on the tool itself, and get a sense of all the attributes available to see about workloads. Right now I'm providing a generic template anticipating that. The above will include all metadata that the workload offers, and the additional features requested in the original prompt for reasons for pending or misconfiguration.
   362  
   363  The above should also provide different output formats:
   364  
   365  ```bash
   366  kueue describe taco-123 -o json
   367  kueue describe taco-123 -o yaml
   368  ```
   369  
   370  ### Watch Workload
   371  
   372  After submitting a workload, it's nice to be able to watch / stream logs. That should be easy to do.
   373  
   374  ```bash
   375  kueue watch taco-123
   376  ```
   377  ```console
   378  ... makin' tacos!
   379  ... taco 1 is cooked!
   380  ... taco 2 is cooking.
   381  ```
   382  
   383  Or a user may want to watch or stream events:
   384  
   385  kueue watch taco-123 --events
   386  ```
   387  ```console
   388  <timestamp> <event1>
   389  <timestamp> <event2>
   390  ...
   391  ```
   392  
   393  
   394  ### Queues
   395  
   396  I haven't used queues extensively so this likely needs to be expanded, but akin to listing workloads, we probably want to list kueues. For all queues, this could be:
   397  
   398  ```bash
   399  kueue queues
   400  ```
   401  
   402  | Name | Admitted | Pending | State |
   403  |------|-----------|---------|------|
   404  | user-queue | 1 | 1 | Operational |
   405  
   406  To filter to a specific queue:
   407  
   408  ```bash
   409  kueue queues user-queue
   410  ```
   411  
   412  We likely want to be able to provide more metadata, and either we could add a `describe-queue` subcommand, or have the above second example show the more verbose information. I think I like the first idea. We also likely want to expand these subcommands to include details that the original prompt warranted. I'm not familiar enough with Kueue yet to add them. This command should also support yaml/json.
   413  
   414  ```bash
   415  kueue queues -o json
   416  ```
   417  
   418  
   419  ### Possible command extensions
   420  
   421  In the future, if we can make this a full fledged client for interaction with workloads, we could consider the following commands.
   422  Note that these are not proposed to be in the first stage of this design document.
   423  
   424  #### Cancel Workloads
   425  
   426  A request to cancel would be akin to deleting the CRD. A "cancel" is more intuitive / natural than a delete request for this use case.
   427  This implementation will be tricky because we need to make the request to the underlying controller.
   428  
   429  
   430  ```bash
   431  kueue cancel taco-123
   432  ```
   433  
   434  It might also be useful to request a cancel all, limited to the permission that the user has, a namespace, or other filter. This likely needs a confirmation.
   435  
   436  ```bash
   437  kueue cancel --all
   438  > Are you sure you want to cancel all workloads y/n?
   439  ```
   440  
   441  Or without the prompt:
   442  
   443  ```bash
   444  kueue cancel --all --force
   445  ```
   446  
   447  Or within a specific filter:
   448  
   449  ```bash
   450  kueue cancel --all --namespace insects
   451  ```
   452  
   453  To support the multi-tenancy use case, filters for each of local-queue, cluster-queue and cohort will be allowed.
   454  
   455  
   456  #### Submit
   457  
   458  ```
   459  # Submit, either a yaml as is
   460  kueue submit workload.yaml
   461  
   462  # or a simpler abstraction that uses some kind of default or template
   463  # This would actually be really cool if we could map a community develoepd job or workload spec (that works for other tools) into kueue
   464  kueue submit <something else>
   465  ```
   466  
   467  #### Update
   468  
   469  A workload can be updated by its creator, and this happens via kueue's internals on the level of the workload controller and scheduler.
   470  As an example, here is what updated an attribute on a workload pod might look like:
   471  
   472  ```bash
   473  # Note this format follows how helm sets variables
   474  kueue update set path.to.attribute=thing
   475  
   476  # This does not, but it could be wanted to remove an attribute entirely (instead of trying to set the default null type)
   477  kueue update remove path.to.attribute
   478  ```
   479  
   480  Different kinds of updates will need to be defined. To allow for an update namespace, we could take either of the following
   481  client approaches:
   482  
   483  ```bash
   484  kueue update pods path.to.attribute=thing
   485  kueue update-pods path.to.attribute=thing
   486  
   487  kueue update cohort ...
   488  kueue update-cohort 
   489  
   490  kueue update cluster-queue
   491  kueue update-cluster-queue
   492  ```
   493  
   494  ### Resources
   495  
   496  Get resources requested or allocated for a workload (can be used to debug)
   497  
   498  ```bash
   499  kueue resource taco-123
   500  ```
   501  
   502  | Name | Quantity |
   503  |------|-----------|
   504  | cpu | 4 |
   505  | memory | 2Gi |
   506  
   507  ```bash
   508  kueue resource taco-123 -o yaml
   509  ```
   510  
   511  This subcommand could provide other types of resources - we should consider this!
   512  
   513  
   514  ### Test Plan
   515  
   516  <!--
   517  **Note:** *Not required until targeted at a release.*
   518  The goal is to ensure that we don't accept enhancements with inadequate testing.
   519  
   520  All code is expected to have adequate tests (eventually with coverage
   521  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   522  when drafting this test plan.
   523  
   524  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   525  -->
   526  
   527  The kueue plugin should be tested for functionality alongside the current head of
   528  the repository.
   529  
   530  [x] I understand the owners of the involved components may require updates to
   531  existing tests to make this code solid enough prior to committing the changes necessary
   532  to implement this enhancement.
   533  
   534  ##### Prerequisite testing updates
   535  
   536  <!--
   537  Based on reviewers feedback describe what additional tests need to be added prior
   538  implementing this enhancement to ensure the enhancements have also solid foundations.
   539  -->
   540  
   541  
   542  #### Unit Tests
   543  
   544  <!--
   545  In principle every added code should have complete unit test coverage, so providing
   546  the exact set of tests will not bring additional value.
   547  However, if complete unit test coverage is not possible, explain the reason of it
   548  together with explanation why this is acceptable.
   549  -->
   550  
   551  <!--
   552  Additionally, try to enumerate the core package you will be touching
   553  to implement this enhancement and provide the current unit coverage for those
   554  in the form of:
   555  - <package>: <date> - <current test coverage>
   556  
   557  This can inform certain test coverage improvements that we want to do before
   558  extending the production code to implement this enhancement.
   559  -->
   560  
   561  #### Integration tests
   562  
   563  
   564  <!--
   565  Describe what tests will be added to ensure proper quality of the enhancement.
   566  
   567  After the implementation PR is merged, add the names of the tests here.
   568  -->
   569  
   570  ### Graduation Criteria
   571  
   572  <!--
   573  
   574  Clearly define what it means for the feature to be implemented and
   575  considered stable.
   576  
   577  If the feature you are introducing has high complexity, consider adding graduation
   578  milestones with these graduation criteria:
   579  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   580  - [Feature gate][feature gate] lifecycle
   581  - [Deprecation policy][deprecation-policy]
   582  
   583  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   584  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   585  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   586  -->
   587  
   588  N/A
   589  
   590  ## Implementation History
   591  
   592  <!--
   593  Major milestones in the lifecycle of a KEP should be tracked in this section.
   594  Major milestones might include:
   595  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   596  - the `Proposal` section being merged, signaling agreement on a proposed design
   597  - the date implementation started
   598  - the first Kubernetes release where an initial version of the KEP was available
   599  - the version of Kubernetes where the KEP graduated to general availability
   600  - when the KEP was retired or superseded
   601  -->
   602  
   603  ## Drawbacks
   604  
   605  ## Alternatives
   606  
   607  <!--
   608  What other approaches did you consider, and why did you rule them out? These do
   609  not need to be as detailed as the proposal, but should include enough
   610  information to express the idea and why it was not acceptable.
   611  -->
   612  
   613  Require a user to write their own interactions with Kueue via API.
   614  
   615  **Reasons for discarding/deferring**
   616  
   617  This is a complex thing to do, and should not be required by a user.