sigs.k8s.io/kueue@v0.6.2/keps/487-kubectl-plugin/README.md (about) 1 # KEP-487: Kubectl plugin for listing objects 2 3 <!-- 4 This is the title of your KEP. Keep it short, simple, and descriptive. A good 5 title can help communicate what the KEP is and should be considered as part of 6 any review. 7 --> 8 9 <!-- 10 A table of contents is helpful for quickly jumping to sections of a KEP and for 11 highlighting any additional information provided beyond the standard KEP 12 template. 13 14 Ensure the TOC is wrapped with 15 <code><!-- toc --&rt;<!-- /toc --&rt;</code> 16 tags, and then generate with `hack/update-toc.sh`. 17 --> 18 19 <!-- toc --> 20 - [Summary](#summary) 21 - [Motivation](#motivation) 22 - [Goals](#goals) 23 - [Non-Goals](#non-goals) 24 - [Proposal](#proposal) 25 - [User Stories (Optional)](#user-stories-optional) 26 - [Story 1](#story-1) 27 - [Story 2](#story-2) 28 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 29 - [Risks and Mitigations](#risks-and-mitigations) 30 - [Design Details](#design-details) 31 - [Summary](#summary-1) 32 - [Version](#version) 33 - [Listing Jobs](#listing-jobs) 34 - [Listing Workloads](#listing-workloads) 35 - [Describe Workloads](#describe-workloads) 36 - [Watch Workload](#watch-workload) 37 - [Queues](#queues) 38 - [Possible command extensions](#possible-command-extensions) 39 - [Cancel Workloads](#cancel-workloads) 40 - [Submit](#submit) 41 - [Update](#update) 42 - [Resources](#resources) 43 - [Test Plan](#test-plan) 44 - [Prerequisite testing updates](#prerequisite-testing-updates) 45 - [Unit Tests](#unit-tests) 46 - [Integration tests](#integration-tests) 47 - [Graduation Criteria](#graduation-criteria) 48 - [Implementation History](#implementation-history) 49 - [Drawbacks](#drawbacks) 50 - [Alternatives](#alternatives) 51 <!-- /toc --> 52 53 ## Summary 54 55 We would like to develop a Kubectl plugin for Kueue that serves as a command-line workload management tool. 56 It should be able to manage and list workloads and ClusterQueues with comprehesive information. 57 58 59 <!-- 60 This section is incredibly important for producing high-quality, user-focused 61 documentation such as release notes or a development roadmap. It should be 62 possible to collect this information before implementation begins, in order to 63 avoid requiring implementors to split their attention between writing release 64 notes and implementing the feature itself. KEP editors and SIG Docs 65 should help to ensure that the tone and content of the `Summary` section is 66 useful for a wide audience. 67 68 A good summary is probably at least a paragraph in length. 69 70 Both in this section and below, follow the guidelines of the [documentation 71 style guide]. In particular, wrap lines to a reasonable length, to make it 72 easier for reviewers to cite specific portions, and to minimize diff churn on 73 updates. 74 75 [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md 76 --> 77 78 ## Motivation 79 80 The basic information provided by `kubectl get` does include details such as whether a workload is admitted or why it's pending. 81 As this information is available in compound objects, we need to perform different queries to the API to get it. 82 Similarly, `kubectl` cannot tell the status of a ClusterQueue that is misconfigured, nor does it 83 provide details of an accompanying Workload listed alongside a job. We want to expose this information 84 to the user, and in a way that feels comfortable and familiar. 85 86 87 <!-- 88 This section is for explicitly listing the motivation, goals, and non-goals of 89 this KEP. Describe why the change is important and the benefits to users. The 90 motivation section can optionally provide links to [experience reports] to 91 demonstrate the interest in a KEP within the wider Kubernetes community. 92 93 [experience reports]: https://github.com/golang/go/wiki/ExperienceReports 94 --> 95 96 ### Goals 97 98 A successful plugin should be able to answer the following questions: 99 100 1. What workloads are there in the user-queue? 101 2. What workloads are there in this specific namespace? 102 3. What workloads are there in this specific state? 103 4. Why is my workload pending? 104 5. Was there an error admitting my workload? 105 6. Is a ClusterQueue misconfigured or some other issue? (this is more of an admin command?) 106 7. All of the above, but instead of a table I want json/yaml 107 108 And for this scoped work, the plugin should be added to the [krew plugins package manager](https://krew.sigs.k8s.io/plugins/). 109 110 <!-- 111 List the specific goals of the KEP. What is it trying to achieve? How will we 112 know that this has succeeded? 113 --> 114 115 ### Non-Goals 116 117 - Serve as a duplicate implementation of information available via kubectl already 118 - Streaming logs 119 120 <!-- 121 What is out of scope for this KEP? Listing non-goals helps to focus discussion 122 and make progress. 123 --> 124 125 ## Proposal 126 127 We propose creating a command-line tool that can serve as a Kubectl plugin that exposes this missing information. We also propose using a design strategy that mimics existing tools that are available for other workload managers across HPC and cloud that users are comforable with to ease adoption of both the tool and approach of submitting workloads to Kubernetes. The main command-line interactions will take the following shape: 128 129 ```bash 130 # As a Kubectl plugin 131 kubectl kueue <options> <command< <args> 132 133 # As a standalone command-line tool 134 kueue <options> <command> <args> 135 ``` 136 137 For comparison, the Armada team developed a [general client](https://github.com/armadaproject/armada/tree/master/cmd/armadactl) that uses this strategy to be renamed and fold into kubectl. This is an ideal approach because it presents an interface to kueue to make it feel more like a standalone job manager (and not force folks to live in Kubernetes abstractions, at least entirely). 138 139 140 <!-- 141 This is where we get down to the specifics of what the proposal actually is. 142 This should have enough detail that reviewers can understand exactly what 143 you're proposing, but should not include things like API designs or 144 implementation. What is the desired outcome and how do we measure success?. 145 The "Design Details" section below is for the real 146 nitty-gritty. 147 --> 148 149 ### User Stories (Optional) 150 151 <!-- 152 Detail the things that people will be able to do if this KEP is implemented. 153 Include as much detail as possible so that people can understand the "how" of 154 the system. The goal here is to make this feel real for users without getting 155 bogged down. 156 --> 157 158 #### Story 1 159 160 As a user submitting workloads, I want to easily see the status of an entire Workload, 161 or check on why my workload is not being admitted. I can do this with the new proposed plugin. 162 163 ```bash 164 kueue describe taco-123 165 ``` 166 167 #### Story 2 168 169 As a user coming from High Performance Computing (HPC), I am not comfortable with using 170 `kubectl` and don't want to learn an entirely new means to interact with workloads. 171 The proposed plugin makes the transition much easier for me. The example below 172 compares the proposed list command for this plugin against popular HPC resource 173 managers. 174 175 ```bash 176 # Kubernetes Kueue 177 kueue workloads <queue-name> 178 179 # SLURM 180 squeue --partition <queue-name> 181 182 # Flux Framework 183 flux jobs -a --queue=<queue-name> 184 185 # HTCondor 186 condor_q 187 ``` 188 189 While this is just a sampling, the high level idea is that high performance users 190 are comfortable and accustomed to having command line tools to list and otherwise 191 interact with jobs. Having a similar tool for Kubernetes, and specifically 192 submitting jobs, will help to span the space. 193 194 ### Notes/Constraints/Caveats (Optional) 195 196 <!-- 197 What are the caveats to the proposal? 198 What are some important details that didn't come across above? 199 Go in to as much detail as necessary here. 200 This might be a good place to talk about core concepts and how they relate. 201 --> 202 203 ### Risks and Mitigations 204 205 Maintaining the plugin in-sync with the latest version of Kueue will add additional 206 maintainer responsibilities. 207 208 <!-- 209 What are the risks of this proposal, and how do we mitigate? Think broadly. 210 For example, consider both security and how this will impact the larger 211 Kubernetes ecosystem. 212 213 How will security be reviewed, and by whom? 214 215 How will UX be reviewed, and by whom? 216 217 Consider including folks who also work outside the SIG or subproject. 218 --> 219 220 ## Design Details 221 222 <!-- 223 This section should contain enough information that the specifics of your 224 change are understandable. This may include API specs (though not always 225 required) or even code snippets. If there's any ambiguity about HOW your 226 proposal will be implemented, this is the place to discuss them. 227 --> 228 229 ### Summary 230 231 Kueuectl is a command-line tool for interacting with Kueue, and can serve as a standalone tool or be renamed/installed to act as a kubectl plugin. Using `kueuectl` a user can manage workloads. An alternative (shorter and easier to type name) might also be `kueue`, which I'll use for the remainder of this document. Thus, the main client takes the following format: 232 233 ```bash 234 kueue [subcommand] [flags] 235 ``` 236 237 As a kubectl plugin, we would install it named as `kubectl-kueue` and then the interaction would be: 238 239 ```bash 240 kubectl kueue [subcommand] [flags] 241 ``` 242 243 We can start with basic query of workload metadata and status, and move toward a tool that can further create / delete or otherwise interact with workloads. This design document will proceed with proposed interactions and example tables, which would be printed in the terminal. 244 245 246 ### Version 247 248 ```bash 249 kueue version 250 ``` 251 252 ### Listing Jobs 253 254 A user may be interested to list workloads based on job name. For this need, an API group and kind would be needed. 255 The following should both work to produce the same result: 256 257 ```bash 258 kueue jobs --kind MPIJob <name> 259 kueue jobs --kind kubeflow.org/MPIJob <name> 260 ``` 261 262 | Namespace | Name | API Group | Kind | Command | Pods | Time | State | 263 |-----------|----------|-------------------|-------------|----------|------|--------|-----------| 264 | insects | ant-123 | kubeflow.org | MPIJob | echo | 2 | 4.323s | Completed | 265 266 267 ### Listing Workloads 268 269 A user might also want to see "all" workloads. We will need to decide if "all" means all namespaces, or (akin to `kubectl` just those in default. 270 271 ```bash 272 kueue workloads 273 ``` 274 275 | Namespace | Name | API Group | Kind | Pods | Time | State | Queue | 276 |---------- |------------------------------|-------------|------|--------|-----------|------------| 277 | default | taco-345 | flux-framework.org| MiniCluster | 3 | 2.322s | Running | user-queue | 278 | default | taco-123 | flux-framework.org| MiniCluster | 2 | | Pending | user-queue | 279 | insects | ant-123 | kubeflow.org | MPIJob | 2 | 4.323s | Completed | user-queue | 280 281 Adding the name of q queue will filter down to it: 282 283 ```bash 284 # Generic command 285 kueue workloads -q <queue-name> 286 287 # As an example 288 kueue workloads -q user-queue 289 ``` 290 291 | Namespace | Name | API Group | Kind | Pods | Time | State | 292 |-----------|----------|-------------------|-------------|------|--------|-----------| 293 | default | taco-345 | flux-framework.org| MiniCluster | 3 | 2.322s | Running | 294 | default | taco-123 | flux-framework.org| MiniCluster | 2 | | Pending | 295 | insects | ant-123 | kubeflow.org | MPIJob | 2 | 4.323s | Completed | 296 297 For namespaces, kueue will use the same practices as kubectl, using "default" for a default, and otherwise 298 expecting a `-n` or `--namepsace` argument for a custom namespace. 299 300 ```bash 301 kueue workloads --namespace insects 302 ``` 303 If a user wants to set a default namespace, 304 this [should be supported akin to kubectl](https://www.cloudytuts.com/tutorials/kubernetes/how-to-set-default-kubernetes-namespace/): 305 306 ```bash 307 kueue set default-namespace <namespace> 308 kueue set default-namespace insects 309 kueue workloads 310 ``` 311 312 | Namespace | Name | API Group | Kind | Pods | Time | State | Queue | 313 |-----------|----------|-------------------|-------------|------|--------|-----------|-------| 314 | inspects | ant-123 | kubeflow.org | MPIJob | 2 | 4.323s | Completed | user-queue | 315 316 317 We can also ask for a specific workload based on a job name. E.g., if I just submit a job and know the name, this would be intuitive to type. 318 319 ```bash 320 # kueue workloads <name> 321 kueue workloads job/taco-123 322 ``` 323 324 | Namespace| Name | API Group | Kind | Pods | Time | State | Queue | 325 |----------|-----------|-------------------|-------------|------|--------|-----------|-------| 326 | default | taco-123 | flux-framework.org| MiniCluster | 2 | | Pending | user-queue | 327 328 329 We likely want to also filter by state (or other attributes, TBA which others?) 330 331 ```bash 332 kueue workloads --state Pending 333 ``` 334 335 | Namespace | Name | API Group | Kind | Pods | Time | State | Queue | 336 |-----------|----------|-------------------|-------------|-------|--------|-----------|-------| 337 | default | taco-123 | flux-framework.org| MiniCluster | 2 | | Pending | user-queue | 338 339 340 ### Describe Workloads 341 342 Describe is intended to show more detailed information about one or more workloads. Akin to kubectl describe, we would stack them on top of the other. Unlike kubectl, I think we should have the -o json/yaml options here (it never made sense to me that kubectl uses describe for more rich metadata, but those output variables are available with "get" ! 343 344 ```bash 345 kueue describe taco-123 346 kueue describe taco-123 taco-345 347 ``` 348 ```console 349 Feature Set A: 350 Field A Field B Field C 351 -------- -------- ------ 352 cpu 650m (8%) 0 (0%) 353 memory 100Mi (0%) 0 (0%) 354 Events: 355 Field A Field B Field C 356 ---- ------- ------- ... 357 Normal Starting Pancakes. 358 ``` 359 360 While the describe tables do not include images or commands, the detailed view should. 361 Note that I likely will develop this when I dig into working on the tool itself, and get a sense of all the attributes available to see about workloads. Right now I'm providing a generic template anticipating that. The above will include all metadata that the workload offers, and the additional features requested in the original prompt for reasons for pending or misconfiguration. 362 363 The above should also provide different output formats: 364 365 ```bash 366 kueue describe taco-123 -o json 367 kueue describe taco-123 -o yaml 368 ``` 369 370 ### Watch Workload 371 372 After submitting a workload, it's nice to be able to watch / stream logs. That should be easy to do. 373 374 ```bash 375 kueue watch taco-123 376 ``` 377 ```console 378 ... makin' tacos! 379 ... taco 1 is cooked! 380 ... taco 2 is cooking. 381 ``` 382 383 Or a user may want to watch or stream events: 384 385 kueue watch taco-123 --events 386 ``` 387 ```console 388 <timestamp> <event1> 389 <timestamp> <event2> 390 ... 391 ``` 392 393 394 ### Queues 395 396 I haven't used queues extensively so this likely needs to be expanded, but akin to listing workloads, we probably want to list kueues. For all queues, this could be: 397 398 ```bash 399 kueue queues 400 ``` 401 402 | Name | Admitted | Pending | State | 403 |------|-----------|---------|------| 404 | user-queue | 1 | 1 | Operational | 405 406 To filter to a specific queue: 407 408 ```bash 409 kueue queues user-queue 410 ``` 411 412 We likely want to be able to provide more metadata, and either we could add a `describe-queue` subcommand, or have the above second example show the more verbose information. I think I like the first idea. We also likely want to expand these subcommands to include details that the original prompt warranted. I'm not familiar enough with Kueue yet to add them. This command should also support yaml/json. 413 414 ```bash 415 kueue queues -o json 416 ``` 417 418 419 ### Possible command extensions 420 421 In the future, if we can make this a full fledged client for interaction with workloads, we could consider the following commands. 422 Note that these are not proposed to be in the first stage of this design document. 423 424 #### Cancel Workloads 425 426 A request to cancel would be akin to deleting the CRD. A "cancel" is more intuitive / natural than a delete request for this use case. 427 This implementation will be tricky because we need to make the request to the underlying controller. 428 429 430 ```bash 431 kueue cancel taco-123 432 ``` 433 434 It might also be useful to request a cancel all, limited to the permission that the user has, a namespace, or other filter. This likely needs a confirmation. 435 436 ```bash 437 kueue cancel --all 438 > Are you sure you want to cancel all workloads y/n? 439 ``` 440 441 Or without the prompt: 442 443 ```bash 444 kueue cancel --all --force 445 ``` 446 447 Or within a specific filter: 448 449 ```bash 450 kueue cancel --all --namespace insects 451 ``` 452 453 To support the multi-tenancy use case, filters for each of local-queue, cluster-queue and cohort will be allowed. 454 455 456 #### Submit 457 458 ``` 459 # Submit, either a yaml as is 460 kueue submit workload.yaml 461 462 # or a simpler abstraction that uses some kind of default or template 463 # This would actually be really cool if we could map a community develoepd job or workload spec (that works for other tools) into kueue 464 kueue submit <something else> 465 ``` 466 467 #### Update 468 469 A workload can be updated by its creator, and this happens via kueue's internals on the level of the workload controller and scheduler. 470 As an example, here is what updated an attribute on a workload pod might look like: 471 472 ```bash 473 # Note this format follows how helm sets variables 474 kueue update set path.to.attribute=thing 475 476 # This does not, but it could be wanted to remove an attribute entirely (instead of trying to set the default null type) 477 kueue update remove path.to.attribute 478 ``` 479 480 Different kinds of updates will need to be defined. To allow for an update namespace, we could take either of the following 481 client approaches: 482 483 ```bash 484 kueue update pods path.to.attribute=thing 485 kueue update-pods path.to.attribute=thing 486 487 kueue update cohort ... 488 kueue update-cohort 489 490 kueue update cluster-queue 491 kueue update-cluster-queue 492 ``` 493 494 ### Resources 495 496 Get resources requested or allocated for a workload (can be used to debug) 497 498 ```bash 499 kueue resource taco-123 500 ``` 501 502 | Name | Quantity | 503 |------|-----------| 504 | cpu | 4 | 505 | memory | 2Gi | 506 507 ```bash 508 kueue resource taco-123 -o yaml 509 ``` 510 511 This subcommand could provide other types of resources - we should consider this! 512 513 514 ### Test Plan 515 516 <!-- 517 **Note:** *Not required until targeted at a release.* 518 The goal is to ensure that we don't accept enhancements with inadequate testing. 519 520 All code is expected to have adequate tests (eventually with coverage 521 expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] 522 when drafting this test plan. 523 524 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 525 --> 526 527 The kueue plugin should be tested for functionality alongside the current head of 528 the repository. 529 530 [x] I understand the owners of the involved components may require updates to 531 existing tests to make this code solid enough prior to committing the changes necessary 532 to implement this enhancement. 533 534 ##### Prerequisite testing updates 535 536 <!-- 537 Based on reviewers feedback describe what additional tests need to be added prior 538 implementing this enhancement to ensure the enhancements have also solid foundations. 539 --> 540 541 542 #### Unit Tests 543 544 <!-- 545 In principle every added code should have complete unit test coverage, so providing 546 the exact set of tests will not bring additional value. 547 However, if complete unit test coverage is not possible, explain the reason of it 548 together with explanation why this is acceptable. 549 --> 550 551 <!-- 552 Additionally, try to enumerate the core package you will be touching 553 to implement this enhancement and provide the current unit coverage for those 554 in the form of: 555 - <package>: <date> - <current test coverage> 556 557 This can inform certain test coverage improvements that we want to do before 558 extending the production code to implement this enhancement. 559 --> 560 561 #### Integration tests 562 563 564 <!-- 565 Describe what tests will be added to ensure proper quality of the enhancement. 566 567 After the implementation PR is merged, add the names of the tests here. 568 --> 569 570 ### Graduation Criteria 571 572 <!-- 573 574 Clearly define what it means for the feature to be implemented and 575 considered stable. 576 577 If the feature you are introducing has high complexity, consider adding graduation 578 milestones with these graduation criteria: 579 - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] 580 - [Feature gate][feature gate] lifecycle 581 - [Deprecation policy][deprecation-policy] 582 583 [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md 584 [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions 585 [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ 586 --> 587 588 N/A 589 590 ## Implementation History 591 592 <!-- 593 Major milestones in the lifecycle of a KEP should be tracked in this section. 594 Major milestones might include: 595 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 596 - the `Proposal` section being merged, signaling agreement on a proposed design 597 - the date implementation started 598 - the first Kubernetes release where an initial version of the KEP was available 599 - the version of Kubernetes where the KEP graduated to general availability 600 - when the KEP was retired or superseded 601 --> 602 603 ## Drawbacks 604 605 ## Alternatives 606 607 <!-- 608 What other approaches did you consider, and why did you rule them out? These do 609 not need to be as detailed as the proposal, but should include enough 610 information to express the idea and why it was not acceptable. 611 --> 612 613 Require a user to write their own interactions with Kueue via API. 614 615 **Reasons for discarding/deferring** 616 617 This is a complex thing to do, and should not be required by a user.