sigs.k8s.io/kueue@v0.6.2/keps/976-plain-pods/README.md (about) 1 # KEP-976: Plain Pods 2 3 <!-- toc --> 4 - [Summary](#summary) 5 - [Motivation](#motivation) 6 - [Goals](#goals) 7 - [Non-Goals](#non-goals) 8 - [Proposal](#proposal) 9 - [User Stories (Optional)](#user-stories-optional) 10 - [Story 1](#story-1) 11 - [Story 2](#story-2) 12 - [Story 3](#story-3) 13 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 14 - [Skipping Pods belonging to queued objects](#skipping-pods-belonging-to-queued-objects) 15 - [Pods replaced on failure](#pods-replaced-on-failure) 16 - [Controllers creating too many Pods](#controllers-creating-too-many-pods) 17 - [Risks and Mitigations](#risks-and-mitigations) 18 - [Increased memory usage](#increased-memory-usage) 19 - [Limited size for annotation values](#limited-size-for-annotation-values) 20 - [Design Details](#design-details) 21 - [Gating Pod Scheduling](#gating-pod-scheduling) 22 - [Pods subject to queueing](#pods-subject-to-queueing) 23 - [Constructing Workload objects](#constructing-workload-objects) 24 - [Single Pods](#single-pods) 25 - [Groups of Pods created beforehand](#groups-of-pods-created-beforehand) 26 - [Groups of pods where driver generates workers](#groups-of-pods-where-driver-generates-workers) 27 - [Tracking admitted and finished Pods](#tracking-admitted-and-finished-pods) 28 - [Retrying Failed Pods](#retrying-failed-pods) 29 - [Dynamically reclaiming Quota](#dynamically-reclaiming-quota) 30 - [Metrics](#metrics) 31 - [Test Plan](#test-plan) 32 - [Prerequisite testing updates](#prerequisite-testing-updates) 33 - [Unit Tests](#unit-tests) 34 - [Integration tests](#integration-tests) 35 - [Graduation Criteria](#graduation-criteria) 36 - [Beta](#beta) 37 - [GA](#ga) 38 - [Implementation History](#implementation-history) 39 - [Drawbacks](#drawbacks) 40 - [Alternatives](#alternatives) 41 - [Users create a Workload object beforehand](#users-create-a-workload-object-beforehand) 42 <!-- /toc --> 43 44 ## Summary 45 46 Some batch applications create plain Pods directly, as opposed to managing the Pods through the Job 47 API or a CRD that supports [suspend](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job) semantics. 48 This KEP proposes mechanisms to queue plain Pods through Kueue, individually or in groups, 49 leveraging [pod scheduling gates](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/). 50 51 ## Motivation 52 53 Some batch systems or AI/ML frameworks create plain Pods to represent jobs or tasks of a job. 54 Currently, Kueue relies on the Job API or CRDs that support suspend semantics 55 to control whether the Pods of a job can exist and can be scheduled to Nodes. 56 57 While it is sometimes possible to wrap Pods on a CRD or migrate to the Job API, it could be 58 costly for framework or platform developers to do so. 59 In some scenarios, the framework doesn't know how many Pods belong to a single job. In more extreme 60 cases, Pods are created dynamically once the first Pod starts running. These are sometimes known 61 as elastic jobs. 62 63 A recent enhancement to Kubernetes Pods, scheduling gates, introduced in 1.26 as Alpha, and 1.27 as 64 Beta, allows an external controller to prevent kube-scheduler from scheduling Pods. Kueue can make 65 use of this API to implement queuing semantics for Pods. 66 67 ### Goals 68 69 - Support queueing of individual Pods. 70 - Support queueing of groups of Pods of fixed size, identified by a common label. 71 - Opt-in or opt-out Pods from specific namespaces from queueing. 72 - Support for [dynamic reclaiming of quota](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) 73 for succeeded Pods. 74 75 ### Non-Goals 76 77 - Support for [partial-admission](https://github.com/kubernetes-sigs/kueue/issues/420). 78 79 Since all pods are already created, an implementation of partial admission would imply the 80 deletion of some pods. It is not clear if this matches users expectations, as opposed to support 81 for elastic groups. 82 83 - Support elastic groups of Pods, where the number of Pods changes after the job started. 84 85 While these jobs are one of the motivations for this KEP, the current proposal doesn't support 86 them. These jobs can be addressed in follow up KEPs. 87 88 - Support for advanced Pod retry policies 89 90 Kueue shouldn't re-implement core functionalities that are already available in the Job API. 91 In particular, Kueue does not re-create failed pods. 92 More specifically, in case of re-admission after preemption, it does not 93 re-create pods it deleted. 94 95 - Tracking usage of Pods that were not queued through Kueue. 96 97 ## Proposal 98 99 <!-- 100 This is where we get down to the specifics of what the proposal actually is. 101 This should have enough detail that reviewers can understand exactly what 102 you're proposing, but should not include things like API designs or 103 implementation. What is the desired outcome and how do we measure success?. 104 The "Design Details" section below is for the real 105 nitty-gritty. 106 --> 107 108 ### User Stories (Optional) 109 110 #### Story 1 111 112 As an platform developer, I can queue plain Pods or Pods owned by an object not integrated with 113 Kueue. I just add a queue name to the Pods through a label. 114 115 ```yaml 116 apiVersion: v1 117 kind: Pod 118 metadata: 119 name: foo 120 namespace: pod-namespace 121 labels: 122 kueue.x-k8s.io/queue-name: user-queue 123 spec: 124 containers: 125 - name: job 126 image: hello-world 127 resources: 128 requests: 129 cpu: 1m 130 ``` 131 132 #### Story 2 133 134 As a platform developer, I can queue groups of Pods that might or might not have the same shape 135 (Pod specs). 136 In addition to the queue name, I can specify how many Pods belong to the group. 137 138 The pods of a job following a driver-workers paradigm would look like follows: 139 140 ```yaml 141 apiVersion: v1 142 kind: Pod 143 metadata: 144 name: job-driver 145 labels: 146 kueue.x-k8s.io/queue-name: user-queue 147 kueue.x-k8s.io/pod-group-name: pod-group 148 annotations: 149 kueue.x-k8s.io/pod-group-total-count: "3" 150 spec: 151 containers: 152 - name: job 153 image: driver 154 resources: 155 requests: 156 cpu: 1m 157 --- 158 apiVersion: v1 159 kind: Pod 160 metadata: 161 name: job-worker-0 162 labels: 163 kueue.x-k8s.io/queue-name: user-queue 164 kueue.x-k8s.io/pod-group-name: pod-group 165 annotations: 166 kueue.x-k8s.io/pod-group-total-count: "3" 167 spec: 168 containers: 169 - name: job 170 image: worker 171 args: ["--index", "0"] 172 resources: 173 requests: 174 cpu: 1m 175 vendor.com/gpu: 1 176 --- 177 apiVersion: v1 178 kind: Pod 179 metadata: 180 name: job-worker-1 181 labels: 182 kueue.x-k8s.io/queue-name: user-queue 183 kueue.x-k8s.io/pod-group-name: pod-group 184 annotations: 185 kueue.x-k8s.io/pod-group-total-count: "3" 186 spec: 187 containers: 188 - name: job 189 image: worker 190 args: ["--index", "1"] 191 resources: 192 requests: 193 cpu: 1m 194 vendor.com/gpu: 1 195 ``` 196 197 ### Story 3 198 199 Motivation: In frameworks like Spark, worker Pods are only created after by the driver Pod. As 200 such, the worker Pods specs cannot be predicted beforehand. Even though the job could be considered 201 elastic, generally users wouldn't want to start a spark driver to run if no workers would fit. 202 203 As a Spark user, I can queue the driver Pod while providing the expected shape of the worker Pods. 204 205 ```yaml 206 apiVersion: v1 207 kind: Pod 208 metadata: 209 name: job-driver 210 labels: 211 kueue.x-k8s.io/queue-name: user-queue 212 kueue.x-k8s.io/pod-group-name: pod-group 213 annotations: 214 # If the template is left empty, it means that it will match the spec of this pod. 215 kueue.x-k8s.io/pod-group-sets: |- 216 [ 217 { 218 name: driver, 219 count: 1, 220 }, 221 { 222 name: workers, 223 count: 10, 224 template: 225 spec: 226 containers: 227 - name: worker 228 requests: 229 cpu: 1m 230 } 231 ] 232 spec: 233 containers: 234 - name: job 235 image: hello-world 236 resources: 237 requests: 238 cpu: 1m 239 --- 240 apiVersion: v1 241 kind: Pod 242 metadata: 243 name: job-worker-1 244 labels: 245 kueue.x-k8s.io/queue-name: user-queue 246 kueue.x-k8s.io/pod-group-name: pod-group 247 kueue.x-k8s.io/pod-group-role: worker 248 spec: 249 containers: 250 - name: job 251 image: hello-world 252 resources: 253 requests: 254 cpu: 1m 255 ``` 256 257 ### Notes/Constraints/Caveats (Optional) 258 259 #### Skipping Pods belonging to queued objects 260 261 Pods owned by jobs managed by Kueue should not be subject to extra management. 262 These Pods can be identified based on the ownerReference. For these pods: 263 - The webhook should not add a scheduling gate. 264 - The pod reconciler should not create a corresponding Workload object. 265 266 Note that sometimes the Pods might not be directly owned by a known job object. Here are some 267 special cases: 268 - MPIJob: The launcher Pod is created through a batch/Job, which is also an known to Kueue, so 269 it's not an issue. 270 - JobSet: Also creates Jobs, so not problematic. 271 - RayJob: Pods are owned by a RayCluster object, which we don't currently support. This could be 272 hardcoded a known parent, or we could use label selectors for: 273 ```yaml 274 app.kubernetes.io/created-by: kuberay-operator 275 app.kubernetes.io/name: kuberay 276 ``` 277 278 #### Pods replaced on failure 279 280 It is possible that users of plain Pods have a controller for them to handle failures and 281 re-creations. These Pods should be able to use the quota that was already assigned to the Workload. 282 283 Because Kueue can't know if Pods will be recreated or not, it will hold the entirety of the 284 quota until it can determine that the whole Workload finished (all pods are terminated). 285 In other words, Kueue won't support [dynamically reclaiming quota](https://github.com/kubernetes-sigs/kueue/issues/78) 286 for plain Pods. 287 288 #### Controllers creating too many Pods 289 290 Due to the declarative nature of Kubernetes, it is possible that controllers face race conditions when creating Pods, 291 leading to the accidental creation of more Pods than declared in the group size, specially when reacting to failed 292 Pods. 293 294 The Pod group reconciler will react to additional Pods by deleting the additional Pods that were created last. 295 296 ### Risks and Mitigations 297 298 #### Increased memory usage 299 300 In order to support plain Pods, we need to start watching all Pods, even if they are not supposed 301 to be managed by Kueue. This will increase the memory usage of Kueue just to maintain the 302 informers. 303 304 We can use the following mitigations: 305 306 1. Drop the unused managedFields field from the Pod spec, like kube-scheduler is doing 307 https://github.com/kubernetes/kubernetes/pull/119556 308 2. Apply a selector in the informer to only keep the Pods that have the `kueue.x-k8s.io/managed: true`. 309 3. Users can configure the webhook to only apply to certain namespaces. By default, the webhook 310 won't apply to the kube-system and kueue-system namespaces. 311 312 #### Limited size for annotation values 313 314 [Story 3](#story-3) can be limited by the annotation size limit (256kB across all annotation values). 315 There isn't much we can do other than documenting the limitation. We can also suggest users to 316 only list the fields relevant to scheduling, as documented for [Groups of Pods created beforehand](#groups-of-pods-created-beforehand). 317 - node affinity and selectors 318 - pod affinity 319 - tolerations 320 - topology spread constraints 321 - container requests 322 - pod overhead 323 324 ## Design Details 325 326 ### Gating Pod Scheduling 327 328 Pods subject to queueing should be prevented from scheduling until Kueue has admitted them in a 329 specific flavor. 330 331 Kubernetes 1.27 and newer provide the mechanism of [scheduling readiness](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/) 332 to prevent kube-scheduler from assigning Nodes to Pods. 333 334 A Kueue webhook will inject to [Pods subject to queueing](#pods-subject-to-queueing): 335 - A scheduling gate `kueue.x-k8s.io/admission` to prevent the Pod from scheduling. 336 - A label `kueue.x-k8s.io/managed: true` so that users can easily identify pods that are/were 337 managed by Kueue. 338 - A finalizer `kueue.x-k8s.io/managed` in order to reliably track pod terminations. 339 340 A Pod reconciler will be responsible for removing the `kueue.x-k8s.io/admission` gate. If the Pods 341 have other gates, they will remain Pending, but would be considered active from Kueue's perspective. 342 343 #### Pods subject to queueing 344 345 Not all Pods in a cluster should be subject to queueing. 346 In particular the following pods should be excluded from getting the scheduling gate or label. 347 348 1. Pods owned by other job APIs managed by kueue. 349 350 They can be identified by the ownerReference, based on the list of enabled integrations. 351 352 In some scenarios, users might have custom job objects that own Pods through an indirect object. 353 In these cases, it might be simpler to identify the pods through a label selector. 354 355 2. Pods belonging to specific namespaces (such as kube-system or kueue-system). 356 357 The namespaces and pod selectors are defined in Configuration.Integrations. 358 For a Pod to qualify for queueing by Kueue, it needs to satisfy both the namespace and pod selector. 359 360 ```golang 361 type Integrations struct { 362 Frameworks []string 363 PodOptions *PodIntegrationOptions 364 } 365 366 type PodIntegrationOptions struct { 367 NamespaceSelector *metav1.LabelSelector 368 PodSelector *metav1.LabelSelector 369 } 370 ``` 371 372 When empty, Kueue uses the following NamespaceSelector internally: 373 374 ```yaml 375 matchExpressions: 376 - key: kubernetes.io/metadata.name 377 operator: NotIn 378 values: [kube-system, kueue-system] 379 ``` 380 381 ### Constructing Workload objects 382 383 Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/managed: true` label, 384 the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic. 385 386 The Workload will be owned by all Pods. Once all the Pods that own the workload are deleted (and 387 their finalizers are removed), the Workload will be automatically cleaned up. 388 389 If individual Pods in the group fail and a replacement Pod comes in, the replacement Pod will be 390 added as an owner of the Workload as well. 391 392 #### Single Pods 393 394 The simplest case we want to support is single Pod jobs. These Pods only have the label 395 `kueue.x-k8s.io/queue-name`, indicating the local queue where they will be queued. 396 397 The Workload for the Pod in [story 1](#story-1) would look as follows: 398 399 ```yaml 400 apiVersion: kueue.x-k8s.io/v1beta1 401 kind: Workload 402 metadata: 403 name: pod-foo 404 namespace: pod-namespace 405 spec: 406 queueName: queue-name 407 podSets: 408 - count: 1 409 name: main # this name is irrelevant. 410 template: 411 spec: 412 containers: 413 - name: job 414 image: hello-world 415 resources: 416 requests: 417 cpu: 1m 418 ``` 419 420 #### Groups of Pods created beforehand 421 422 When a group of pods have different shapes, we need to group them into buckets of similar specs in 423 order to create a Workload object. 424 425 To fully identify the group of pods, the pods need the following: 426 - the label `kueue.x-k8s.io/pod-group-name`, as a unique identifier for the group. This should 427 be a valid CRD name. 428 - The annotation `kueue.x-k8s.io/pod-group-total-count` to indicate how many pods to expect in 429 the group. 430 431 The Pod reconciler would group the pods into similar buckets by only looking at the fields that are 432 relevant to admission, scheduling and/or autoscaling. 433 This list might need to be updated for Kubernetes versions that add new fields relevant to 434 scheduling. The list of fields to keep are: 435 - In `metadata`: `labels` (ignoring labels with the `kueue.x-k8s.io/` prefix) 436 - In `spec`: 437 - In `initContainers` and `containers`: `image`, `requests` and `ports`. 438 - `nodeSelector` 439 - `affinity` 440 - `tolerations` 441 - `runtimeClassName` 442 - `priority` 443 - `preemptionPolicy` 444 - `topologySpreadConstraints` 445 - `overhead` 446 - `resourceClaims` 447 448 Note that fields like `env` and `command` can sometimes change among all the pods of a group and 449 they don't influence scheduling, so they are safe to skip. `volumes` can influence scheduling, but 450 they can be parameterized, like in StatefulSets, so we will ignore them for now. 451 452 A sha256 of the reamining Pod spec will be used as a name for a Workload podSet. The count for the 453 podSet will be the number of Pods that match the same sha256. The hash will be calculated by the 454 webhook and stored as an annotation: `kueue.x-k8s.io/role-hash`. 455 456 We can only build the Workload object once we observe the number of Pods defined by the 457 `kueue.x-k8s.io/pod-group-total-count` annotation. 458 If there are more Pending, Running or Succeeded Pods than the annotation declares, the reconciler 459 deletes the Pods with the highest `creationTimestamp` and removes their finalizers, prior to creating the Workload object. 460 Similarly, when the group has been admitted, the reconciler will detect and delete any extra Pods per role. 461 462 If Pods with the same `pod-group-name` have different values for the `pod-group-total-count` 463 annotation, the reconciler will not create a Workload object and it will emit an event for the Pod 464 indicating the reason. 465 466 The Workload for the Pod in [story 2](#story-2) would look as follows: 467 468 ```yaml 469 apiVersion: kueue.x-k8s.io/v1beta1 470 kind: Workload 471 metadata: 472 name: pod-group 473 namespace: pod-namespace 474 spec: 475 queueName: queue-name 476 podSets: 477 - count: 1 478 name: driver 479 template: 480 spec: 481 containers: 482 - name: job 483 image: driver 484 resources: 485 requests: 486 cpu: 1m 487 - count: 2 488 name: worker 489 template: 490 spec: 491 containers: 492 - name: job 493 image: worker 494 resources: 495 requests: 496 cpu: 1m 497 vendor.com/gpu: 1 498 ``` 499 500 **Caveats:** 501 502 If the number of different sha256s obtained from the groups of Pods is greater than 8, 503 Workload creation will fail. 504 This generally shouldn't be a problem, unless multiple Pods (that should be considered the same 505 from an admission perspective) have different label values or reference different volume claims. 506 507 Based on user feedback, we can consider excluding certaing labels and volumes, or make it 508 configurable. 509 510 #### Groups of pods where driver generates workers 511 512 When most Pods of a group are only created after a subset of them start running, users need to 513 provide the shapes of the following pods before hand. 514 515 Users can provide the shapes of the remaining roles in an annotation 516 `kueue.x-k8s.io/pod-group-sets`, taking a yaml/json with the same structure as the Workload PodSets. 517 The template for the initial pods can be left empty, as it can be populated by Kueue. 518 519 The Workload for the Pod in [story 3](#story-3) would look as follows: 520 521 ```yaml 522 apiVersion: kueue.x-k8s.io/v1beta1 523 kind: Workload 524 metadata: 525 name: pod-group 526 namespace: pod-namespace 527 spec: 528 queueName: queue-name 529 podSets: 530 - count: 1 531 name: driver 532 template: 533 spec: 534 containers: 535 - name: job 536 image: hello-world 537 resources: 538 requests: 539 cpu: 1m 540 - count: 10 541 name: worker 542 template: 543 spec: 544 containers: 545 - name: job 546 image: hello-world 547 resources: 548 requests: 549 cpu: 1m 550 ``` 551 552 ### Tracking admitted and finished Pods 553 554 Pods need to have finalizers so that we can reliably track how many of them run to completion and be 555 able to determine when the Workload is Finished. 556 557 The Pod reconciler will run in a "composable" mode: a mode where a Workload is composed of multiple 558 objects. The `jobframework.Reconciler` will be reworked to accomodate this. 559 560 After a Workload is admitted, each Pod that owns the workload enters the reconciliation loop. 561 The reconciliation loop collects all the Pods that are not Failed and constructs an in-memory 562 Workload. If there is an existing Workload in the cache and it has smaller Pod counters than the 563 in-memory Workload, then it is considered unmatching and the Workload is evicted. 564 565 In the Pod-group reconciler: 566 1. If the Pod is not terminated and doesn't have a deletionTimestamp, 567 create a Workload for the pod group if one does not exist. 568 2. Remove Pod finalizers if: 569 - The Pod is terminated and the Workload is finished or has a deletion timestamp. 570 - The Pod Failed and a valid replacement pod was created for it. 571 3. Build the in-memory Workload. If its podset counters are greater than the stored Workload, 572 then evict the Workload. 573 4. For gated pods: 574 - remove the gate, set nodeSelector 575 5. If the number of succeeded pods is equal to the admission count, mark the Workload as Finished 576 and remove the finalizers from the Pods. 577 578 ### Retrying Failed Pods 579 580 The Pod group will generally only be considered finished if all the Pods finish with a Succeeded 581 phase. 582 This allows the user to send replacement Pods when a Pod in the group fails or if the group is 583 preempted. The replacement Pods can have any name, but they must point to the same pod group. 584 Once a replacement Pod is created, and Kueue has added it as an owner of the Workload, the 585 Failed pod will be finalized. If multiple Pods have Failed, a new Pod is assumed to replace 586 the Pod that failed first. 587 588 To declare that a group is failed, a user can execute one of the following actions: 589 1. Issue a Delete for the Workload object. The controller would terminate all running Pods and 590 clean up Pod finalizers. 591 2. Add an annotation to any Pod in the group `kueue.x-k8s.io/retriable-in-group: false`. 592 The annotation can be added to an existing Pod or added on creation. 593 594 Kueue will consider a group finished if there are no running or pending Pods, and at 595 least one terminated Pod (Failed or Succeeded) has the `retriable-in-group: false` annotation. 596 597 ### Dynamically reclaiming Quota 598 599 Succeeded Pods will not be considered replaceable. In other words, the quota 600 from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) 601 in the Workload status. 602 603 ### Metrics 604 605 In addition to the existing metrics for workloads, it could be beneficial to track gated and 606 unsuspended pods. 607 608 - `pods_gated_total`: Tracks the number of pods that get the scheduling gate. 609 - `pods_ungated_total`: Tracks the number of pods that get the scheduling gate removed. 610 - `pods_rejected_total`: Tracks the number of pods that were rejected because there was an excess 611 number of pods compared to the annotations. 612 613 ### Test Plan 614 615 <!-- 616 **Note:** *Not required until targeted at a release.* 617 The goal is to ensure that we don't accept enhancements with inadequate testing. 618 619 All code is expected to have adequate tests (eventually with coverage 620 expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] 621 when drafting this test plan. 622 623 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 624 --> 625 626 [x] I/we understand the owners of the involved components may require updates to 627 existing tests to make this code solid enough prior to committing the changes necessary 628 to implement this enhancement. 629 630 ##### Prerequisite testing updates 631 632 The unit coverage of `workload_controller.go` needs significant improvement. 633 634 #### Unit Tests 635 636 Current coverage of packages that will be affected 637 638 - `pkg/controller/jobframework/reconciler.go`: `2023-08-14` - `60.9%` 639 - `pkg/controller/core/workload_controller.go`: `2023-08-14` - `7%` 640 - `pkg/metrics`: `2023-08-14` - `97%` 641 - `main.go`: `2023-08-14` - `16.4%` 642 643 #### Integration tests 644 645 The integration tests should cover the following scenarios: 646 647 - Basic webhook test 648 - Single Pod queued, admitted and finished. 649 - Multiple Pods created beforehand: 650 - queued and admitted 651 - failed pods recreated can use the same quota 652 - Group finished when all pods finish Successfully 653 - Group finished when a Pod with `retriable-in-group: false` annotation finishes. 654 - Group preempted and resumed. 655 - Excess pods before admission, youngest pods are deleted. 656 - Excess pods after admission, youngest pods per role are deleted. 657 - Driver Pod creates workers: 658 - queued and admitted. 659 - worker pods beyond the count are rejected (deleted) 660 - workload finished when all pods finish 661 - Preemption deletes all pods for the workload. 662 663 ### Graduation Criteria 664 665 #### Beta 666 667 The feature will be first released with a Beta maturity level. The feature will not be guarded by a 668 feature gate. However, as opposed to the rest of the integrations, it will not be enabled by 669 default: users have to explicitly enable Pod integration through the configuration API. 670 671 #### GA 672 673 The feature can graduate to GA after addressing feedback for at least 3 consecutive releases. 674 675 ## Implementation History 676 677 <!-- 678 Major milestones in the lifecycle of a KEP should be tracked in this section. 679 Major milestones might include: 680 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 681 - the `Proposal` section being merged, signaling agreement on a proposed design 682 - the date implementation started 683 - the first Kubernetes release where an initial version of the KEP was available 684 - the version of Kubernetes where the KEP graduated to general availability 685 - when the KEP was retired or superseded 686 --> 687 688 - Sep 29th: Implemented single Pod support (story 1) [#1103](https://github.com/kubernetes-sigs/kueue/pulls/1103). 689 - Nov 24th: Implemented support for groups of Pods (story 2) [#1319](https://github.com/kubernetes-sigs/kueue/pulls/1319) 690 691 ## Drawbacks 692 693 The proposed labels and annotations for groups of pods can be complex to build manually. 694 However, we expect that a job dispatcher or client would create the Pods, not end-users directly. 695 696 For more complex scenarios, users should consider using a CRD to manage their Pods and integrate 697 the CRD with Kueue. 698 699 ## Alternatives 700 701 ### Users create a Workload object beforehand 702 703 An alternative to the multiple annotations in the Pods would be for users to create a Workload 704 object before creating the Pods. The Pods would just have one annotation referencing the Workload 705 name. 706 707 While this would be a clean approach, this proposal is targetting users that don't have a CRD 708 wrapping their Pods, and adding one would be a bigger effort than adding annotations. Such amount 709 of effort could be similar to migrating from plain Pods to the Job API, which is already supported. 710 711 We could reconsider this based on user feedback.