sigs.k8s.io/kueue@v0.6.2/keps/976-plain-pods/README.md

sigs.k8s.io/kueue@v0.6.2/keps/976-plain-pods/README.md (about)

     1  # KEP-976: Plain Pods
     2  
     3  <!-- toc -->
     4  - [Summary](#summary)
     5  - [Motivation](#motivation)
     6    - [Goals](#goals)
     7    - [Non-Goals](#non-goals)
     8  - [Proposal](#proposal)
     9    - [User Stories (Optional)](#user-stories-optional)
    10      - [Story 1](#story-1)
    11      - [Story 2](#story-2)
    12    - [Story 3](#story-3)
    13    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    14      - [Skipping Pods belonging to queued objects](#skipping-pods-belonging-to-queued-objects)
    15      - [Pods replaced on failure](#pods-replaced-on-failure)
    16      - [Controllers creating too many Pods](#controllers-creating-too-many-pods)
    17    - [Risks and Mitigations](#risks-and-mitigations)
    18      - [Increased memory usage](#increased-memory-usage)
    19      - [Limited size for annotation values](#limited-size-for-annotation-values)
    20  - [Design Details](#design-details)
    21    - [Gating Pod Scheduling](#gating-pod-scheduling)
    22      - [Pods subject to queueing](#pods-subject-to-queueing)
    23    - [Constructing Workload objects](#constructing-workload-objects)
    24      - [Single Pods](#single-pods)
    25      - [Groups of Pods created beforehand](#groups-of-pods-created-beforehand)
    26      - [Groups of pods where driver generates workers](#groups-of-pods-where-driver-generates-workers)
    27    - [Tracking admitted and finished Pods](#tracking-admitted-and-finished-pods)
    28    - [Retrying Failed Pods](#retrying-failed-pods)
    29    - [Dynamically reclaiming Quota](#dynamically-reclaiming-quota)
    30    - [Metrics](#metrics)
    31    - [Test Plan](#test-plan)
    32        - [Prerequisite testing updates](#prerequisite-testing-updates)
    33      - [Unit Tests](#unit-tests)
    34      - [Integration tests](#integration-tests)
    35    - [Graduation Criteria](#graduation-criteria)
    36      - [Beta](#beta)
    37      - [GA](#ga)
    38  - [Implementation History](#implementation-history)
    39  - [Drawbacks](#drawbacks)
    40  - [Alternatives](#alternatives)
    41    - [Users create a Workload object beforehand](#users-create-a-workload-object-beforehand)
    42  <!-- /toc -->
    43  
    44  ## Summary
    45  
    46  Some batch applications create plain Pods directly, as opposed to managing the Pods through the Job
    47  API or a CRD that supports [suspend](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job) semantics.
    48  This KEP proposes mechanisms to queue plain Pods through Kueue, individually or in groups,
    49  leveraging [pod scheduling gates](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/).
    50  
    51  ## Motivation
    52  
    53  Some batch systems or AI/ML frameworks create plain Pods to represent jobs or tasks of a job.
    54  Currently, Kueue relies on the Job API or CRDs that support suspend semantics
    55  to control whether the Pods of a job can exist and can be scheduled to Nodes.
    56  
    57  While it is sometimes possible to wrap Pods on a CRD or migrate to the Job API, it could be
    58  costly for framework or platform developers to do so.
    59  In some scenarios, the framework doesn't know how many Pods belong to a single job. In more extreme
    60  cases, Pods are created dynamically once the first Pod starts running. These are sometimes known
    61  as elastic jobs.
    62  
    63  A recent enhancement to Kubernetes Pods, scheduling gates, introduced in 1.26 as Alpha, and 1.27 as
    64  Beta, allows an external controller to prevent kube-scheduler from scheduling Pods. Kueue can make
    65  use of this API to implement queuing semantics for Pods.
    66  
    67  ### Goals
    68  
    69  - Support queueing of individual Pods.
    70  - Support queueing of groups of Pods of fixed size, identified by a common label.
    71  - Opt-in or opt-out Pods from specific namespaces from queueing.
    72  - Support for [dynamic reclaiming of quota](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim)
    73    for succeeded Pods.
    74  
    75  ### Non-Goals
    76  
    77  - Support for [partial-admission](https://github.com/kubernetes-sigs/kueue/issues/420).
    78  
    79    Since all pods are already created, an implementation of partial admission would imply the
    80    deletion of some pods. It is not clear if this matches users expectations, as opposed to support
    81    for elastic groups.
    82  
    83  - Support elastic groups of Pods, where the number of Pods changes after the job started.
    84  
    85    While these jobs are one of the motivations for this KEP, the current proposal doesn't support
    86    them. These jobs can be addressed in follow up KEPs.
    87  
    88  - Support for advanced Pod retry policies
    89  
    90    Kueue shouldn't re-implement core functionalities that are already available in the Job API.
    91    In particular, Kueue does not re-create failed pods.
    92    More specifically, in case of re-admission after preemption, it does not
    93    re-create pods it deleted.
    94  
    95  - Tracking usage of Pods that were not queued through Kueue.
    96  
    97  ## Proposal
    98  
    99  <!--
   100  This is where we get down to the specifics of what the proposal actually is.
   101  This should have enough detail that reviewers can understand exactly what
   102  you're proposing, but should not include things like API designs or
   103  implementation. What is the desired outcome and how do we measure success?.
   104  The "Design Details" section below is for the real
   105  nitty-gritty.
   106  -->
   107  
   108  ### User Stories (Optional)
   109  
   110  #### Story 1
   111  
   112  As an platform developer, I can queue plain Pods or Pods owned by an object not integrated with
   113  Kueue. I just add a queue name to the Pods through a label.
   114  
   115  ```yaml
   116  apiVersion: v1
   117  kind: Pod
   118  metadata:
   119    name: foo
   120    namespace: pod-namespace
   121    labels:
   122      kueue.x-k8s.io/queue-name: user-queue
   123  spec:
   124    containers:
   125    - name: job
   126      image: hello-world
   127      resources:
   128        requests:
   129          cpu: 1m
   130  ```
   131  
   132  #### Story 2
   133  
   134  As a platform developer, I can queue groups of Pods that might or might not have the same shape
   135  (Pod specs).
   136  In addition to the queue name, I can specify how many Pods belong to the group.
   137  
   138  The pods of a job following a driver-workers paradigm would look like follows:
   139  
   140  ```yaml
   141  apiVersion: v1
   142  kind: Pod
   143  metadata:
   144    name: job-driver
   145    labels:
   146      kueue.x-k8s.io/queue-name: user-queue
   147      kueue.x-k8s.io/pod-group-name: pod-group
   148    annotations:
   149      kueue.x-k8s.io/pod-group-total-count: "3"
   150  spec:
   151    containers:
   152    - name: job
   153      image: driver
   154      resources:
   155        requests:
   156          cpu: 1m
   157  ---
   158  apiVersion: v1
   159  kind: Pod
   160  metadata:
   161    name: job-worker-0
   162    labels:
   163      kueue.x-k8s.io/queue-name: user-queue
   164      kueue.x-k8s.io/pod-group-name: pod-group
   165    annotations:
   166      kueue.x-k8s.io/pod-group-total-count: "3"
   167  spec:
   168    containers:
   169    - name: job
   170      image: worker
   171      args: ["--index", "0"]
   172      resources:
   173        requests:
   174          cpu: 1m
   175          vendor.com/gpu: 1
   176  ---
   177  apiVersion: v1
   178  kind: Pod
   179  metadata:
   180    name: job-worker-1
   181    labels:
   182      kueue.x-k8s.io/queue-name: user-queue
   183      kueue.x-k8s.io/pod-group-name: pod-group
   184    annotations:
   185      kueue.x-k8s.io/pod-group-total-count: "3"
   186  spec:
   187    containers:
   188    - name: job
   189      image: worker
   190      args: ["--index", "1"]
   191      resources:
   192        requests:
   193          cpu: 1m
   194          vendor.com/gpu: 1
   195  ```
   196  
   197  ### Story 3
   198  
   199  Motivation: In frameworks like Spark, worker Pods are only created after by the driver Pod. As
   200  such, the worker Pods specs cannot be predicted beforehand. Even though the job could be considered
   201  elastic, generally users wouldn't want to start a spark driver to run if no workers would fit.
   202  
   203  As a Spark user, I can queue the driver Pod while providing the expected shape of the worker Pods.
   204  
   205  ```yaml
   206  apiVersion: v1
   207  kind: Pod
   208  metadata:
   209    name: job-driver
   210    labels:
   211      kueue.x-k8s.io/queue-name: user-queue
   212      kueue.x-k8s.io/pod-group-name: pod-group
   213    annotations:
   214      # If the template is left empty, it means that it will match the spec of this pod.
   215      kueue.x-k8s.io/pod-group-sets: |-
   216        [
   217          {
   218            name: driver,
   219            count: 1,
   220          },
   221          {
   222            name: workers,
   223            count: 10,
   224            template:
   225              spec:
   226                containers:
   227                  - name: worker
   228                    requests:
   229                      cpu: 1m
   230          }
   231        ]
   232  spec:
   233    containers:
   234    - name: job
   235      image: hello-world
   236      resources:
   237        requests:
   238          cpu: 1m
   239  ---
   240  apiVersion: v1
   241  kind: Pod
   242  metadata:
   243    name: job-worker-1
   244    labels:
   245      kueue.x-k8s.io/queue-name: user-queue
   246      kueue.x-k8s.io/pod-group-name: pod-group
   247      kueue.x-k8s.io/pod-group-role: worker
   248  spec:
   249    containers:
   250    - name: job
   251      image: hello-world
   252      resources:
   253        requests:
   254          cpu: 1m
   255  ```
   256  
   257  ### Notes/Constraints/Caveats (Optional)
   258  
   259  #### Skipping Pods belonging to queued objects
   260  
   261  Pods owned by jobs managed by Kueue should not be subject to extra management.
   262  These Pods can be identified based on the ownerReference. For these pods:
   263  - The webhook should not add a scheduling gate.
   264  - The pod reconciler should not create a corresponding Workload object.
   265  
   266  Note that sometimes the Pods might not be directly owned by a known job object. Here are some
   267  special cases:
   268  - MPIJob: The launcher Pod is created through a batch/Job, which is also an known to Kueue, so
   269    it's not an issue.
   270  - JobSet: Also creates Jobs, so not problematic.
   271  - RayJob: Pods are owned by a RayCluster object, which we don't currently support. This could be
   272    hardcoded a known parent, or we could use label selectors for:
   273    ```yaml
   274    app.kubernetes.io/created-by: kuberay-operator
   275    app.kubernetes.io/name: kuberay
   276    ```
   277  
   278  #### Pods replaced on failure
   279  
   280  It is possible that users of plain Pods have a controller for them to handle failures and
   281  re-creations. These Pods should be able to use the quota that was already assigned to the Workload.
   282  
   283  Because Kueue can't know if Pods will be recreated or not, it will hold the entirety of the
   284  quota until it can determine that the whole Workload finished (all pods are terminated).
   285  In other words, Kueue won't support [dynamically reclaiming quota](https://github.com/kubernetes-sigs/kueue/issues/78)
   286  for plain Pods.
   287  
   288  #### Controllers creating too many Pods
   289  
   290  Due to the declarative nature of Kubernetes, it is possible that controllers face race conditions when creating Pods,
   291  leading to the accidental creation of more Pods than declared in the group size, specially when reacting to failed
   292  Pods.
   293  
   294  The Pod group reconciler will react to additional Pods by deleting the additional Pods that were created last.
   295  
   296  ### Risks and Mitigations
   297  
   298  #### Increased memory usage
   299  
   300  In order to support plain Pods, we need to start watching all Pods, even if they are not supposed
   301  to be managed by Kueue. This will increase the memory usage of Kueue just to maintain the
   302  informers.
   303  
   304  We can use the following mitigations:
   305  
   306  1. Drop the unused managedFields field from the Pod spec, like kube-scheduler is doing
   307     https://github.com/kubernetes/kubernetes/pull/119556
   308  2. Apply a selector in the informer to only keep the Pods that have the `kueue.x-k8s.io/managed: true`.
   309  3. Users can configure the webhook to only apply to certain namespaces. By default, the webhook
   310     won't apply to the kube-system and kueue-system namespaces.
   311  
   312  #### Limited size for annotation values
   313  
   314  [Story 3](#story-3) can be limited by the annotation size limit (256kB across all annotation values).
   315  There isn't much we can do other than documenting the limitation. We can also suggest users to
   316  only list the fields relevant to scheduling, as documented for [Groups of Pods created beforehand](#groups-of-pods-created-beforehand).
   317  - node affinity and selectors
   318  - pod affinity
   319  - tolerations
   320  - topology spread constraints
   321  - container requests
   322  - pod overhead
   323  
   324  ## Design Details
   325  
   326  ### Gating Pod Scheduling
   327  
   328  Pods subject to queueing should be prevented from scheduling until Kueue has admitted them in a
   329  specific flavor.
   330  
   331  Kubernetes 1.27 and newer provide the mechanism of [scheduling readiness](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)
   332  to prevent kube-scheduler from assigning Nodes to Pods.
   333  
   334  A Kueue webhook will inject to [Pods subject to queueing](#pods-subject-to-queueing):
   335  - A scheduling gate `kueue.x-k8s.io/admission` to prevent the Pod from scheduling.
   336  - A label `kueue.x-k8s.io/managed: true` so that users can easily identify pods that are/were
   337    managed by Kueue.
   338  - A finalizer `kueue.x-k8s.io/managed` in order to reliably track pod terminations.
   339  
   340  A Pod reconciler will be responsible for removing the `kueue.x-k8s.io/admission` gate. If the Pods
   341  have other gates, they will remain Pending, but would be considered active from Kueue's perspective.
   342  
   343  #### Pods subject to queueing
   344  
   345  Not all Pods in a cluster should be subject to queueing.
   346  In particular the following pods should be excluded from getting the scheduling gate or label.
   347  
   348  1. Pods owned by other job APIs managed by kueue.
   349  
   350  They can be identified by the ownerReference, based on the list of enabled integrations.
   351  
   352  In some scenarios, users might have custom job objects that own Pods through an indirect object.
   353  In these cases, it might be simpler to identify the pods through a label selector.
   354  
   355  2. Pods belonging to specific namespaces (such as kube-system or kueue-system).
   356  
   357  The namespaces and pod selectors are defined in Configuration.Integrations.
   358  For a Pod to qualify for queueing by Kueue, it needs to satisfy both the namespace and pod selector.
   359  
   360  ```golang
   361  type Integrations struct {
   362    Frameworks []string
   363    PodOptions *PodIntegrationOptions
   364  }
   365  
   366  type PodIntegrationOptions struct {
   367    NamespaceSelector *metav1.LabelSelector
   368    PodSelector *metav1.LabelSelector
   369  }
   370  ```
   371  
   372  When empty, Kueue uses the following NamespaceSelector internally:
   373  
   374  ```yaml
   375  matchExpressions:
   376  - key: kubernetes.io/metadata.name
   377    operator: NotIn
   378    values: [kube-system, kueue-system]
   379  ```
   380  
   381  ### Constructing Workload objects
   382  
   383  Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/managed: true` label,
   384  the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic.
   385  
   386  The Workload will be owned by all Pods. Once all the Pods that own the workload are deleted (and
   387  their finalizers are removed), the Workload will be automatically cleaned up.
   388  
   389  If individual Pods in the group fail and a replacement Pod comes in, the replacement Pod will be
   390  added as an owner of the Workload as well.
   391  
   392  #### Single Pods
   393  
   394  The simplest case we want to support is single Pod jobs. These Pods only have the label
   395  `kueue.x-k8s.io/queue-name`, indicating the local queue where they will be queued.
   396  
   397  The Workload for the Pod in [story 1](#story-1) would look as follows:
   398  
   399  ```yaml
   400  apiVersion: kueue.x-k8s.io/v1beta1
   401  kind: Workload
   402  metadata:
   403    name: pod-foo
   404    namespace: pod-namespace
   405  spec:
   406    queueName: queue-name
   407    podSets:
   408    - count: 1
   409      name: main # this name is irrelevant.
   410      template:
   411        spec:
   412          containers:
   413          - name: job
   414            image: hello-world
   415            resources:
   416              requests:
   417                cpu: 1m
   418  ```
   419  
   420  #### Groups of Pods created beforehand
   421  
   422  When a group of pods have different shapes, we need to group them into buckets of similar specs in
   423  order to create a Workload object.
   424  
   425  To fully identify the group of pods, the pods need the following:
   426  - the label `kueue.x-k8s.io/pod-group-name`, as a unique identifier for the group. This should
   427    be a valid CRD name.
   428  - The annotation `kueue.x-k8s.io/pod-group-total-count` to indicate how many pods to expect in
   429    the group.
   430  
   431  The Pod reconciler would group the pods into similar buckets by only looking at the fields that are
   432  relevant to admission, scheduling and/or autoscaling.
   433  This list might need to be updated for Kubernetes versions that add new fields relevant to 
   434  scheduling. The list of fields to keep are:
   435  - In `metadata`: `labels` (ignoring labels with the `kueue.x-k8s.io/` prefix)
   436  - In `spec`:
   437    - In `initContainers` and `containers`: `image`, `requests` and `ports`.
   438    - `nodeSelector`
   439    - `affinity`
   440    - `tolerations`
   441    - `runtimeClassName`
   442    - `priority`
   443    - `preemptionPolicy`
   444    - `topologySpreadConstraints`
   445    - `overhead`
   446    - `resourceClaims`
   447  
   448  Note that fields like `env` and `command` can sometimes change among all the pods of a group and
   449  they don't influence scheduling, so they are safe to skip. `volumes` can influence scheduling, but
   450  they can be parameterized, like in StatefulSets, so we will ignore them for now.
   451  
   452  A sha256 of the reamining Pod spec will be used as a name for a Workload podSet. The count for the
   453  podSet will be the number of Pods that match the same sha256. The hash will be calculated by the
   454  webhook and stored as an annotation: `kueue.x-k8s.io/role-hash`.
   455  
   456  We can only build the Workload object once we observe the number of Pods defined by the
   457  `kueue.x-k8s.io/pod-group-total-count` annotation.
   458  If there are more Pending, Running or Succeeded Pods than the annotation declares, the reconciler
   459  deletes the Pods with the highest `creationTimestamp` and removes their finalizers, prior to creating the Workload object.
   460  Similarly, when the group has been admitted, the reconciler will detect and delete any extra Pods per role.
   461  
   462  If Pods with the same `pod-group-name` have different values for the `pod-group-total-count`
   463  annotation, the reconciler will not create a Workload object and it will emit an event for the Pod
   464  indicating the reason.
   465  
   466  The Workload for the Pod in [story 2](#story-2) would look as follows:
   467  
   468  ```yaml
   469  apiVersion: kueue.x-k8s.io/v1beta1
   470  kind: Workload
   471  metadata:
   472    name: pod-group
   473    namespace: pod-namespace
   474  spec:
   475    queueName: queue-name
   476    podSets:
   477    - count: 1
   478      name: driver 
   479      template:
   480        spec:
   481          containers:
   482          - name: job
   483            image: driver
   484            resources:
   485              requests:
   486                cpu: 1m
   487    - count: 2
   488      name: worker
   489      template:
   490        spec:
   491          containers:
   492          - name: job
   493            image: worker
   494            resources:
   495              requests:
   496                cpu: 1m
   497                vendor.com/gpu: 1
   498  ```
   499  
   500  **Caveats:**
   501  
   502  If the number of different sha256s obtained from the groups of Pods is greater than 8,
   503  Workload creation will fail.
   504  This generally shouldn't be a problem, unless multiple Pods (that should be considered the same
   505  from an admission perspective) have different label values or reference different volume claims.
   506  
   507  Based on user feedback, we can consider excluding certaing labels and volumes, or make it
   508  configurable.
   509  
   510  #### Groups of pods where driver generates workers
   511  
   512  When most Pods of a group are only created after a subset of them start running, users need to
   513  provide the shapes of the following pods before hand.
   514  
   515  Users can provide the shapes of the remaining roles in an annotation
   516  `kueue.x-k8s.io/pod-group-sets`, taking a yaml/json with the same structure as the Workload PodSets.
   517  The template for the initial pods can be left empty, as it can be populated by Kueue.
   518  
   519  The Workload for the Pod in [story 3](#story-3) would look as follows:
   520  
   521  ```yaml
   522  apiVersion: kueue.x-k8s.io/v1beta1
   523  kind: Workload
   524  metadata:
   525    name: pod-group
   526    namespace: pod-namespace
   527  spec:
   528    queueName: queue-name
   529    podSets:
   530    - count: 1
   531      name: driver 
   532      template:
   533        spec:
   534          containers:
   535          - name: job
   536            image: hello-world
   537            resources:
   538              requests:
   539                cpu: 1m
   540    - count: 10
   541      name: worker
   542      template:
   543        spec:
   544          containers:
   545          - name: job
   546            image: hello-world
   547            resources:
   548              requests:
   549                cpu: 1m
   550  ```
   551  
   552  ### Tracking admitted and finished Pods
   553  
   554  Pods need to have finalizers so that we can reliably track how many of them run to completion and be
   555  able to determine when the Workload is Finished.
   556  
   557  The Pod reconciler will run in a "composable" mode: a mode where a Workload is composed of multiple
   558  objects. The `jobframework.Reconciler` will be reworked to accomodate this.
   559  
   560  After a Workload is admitted, each Pod that owns the workload enters the reconciliation loop.
   561  The reconciliation loop collects all the Pods that are not Failed and constructs an in-memory
   562  Workload. If there is an existing Workload in the cache and it has smaller Pod counters than the
   563  in-memory Workload, then it is considered unmatching and the Workload is evicted.
   564  
   565  In the Pod-group reconciler:
   566  1. If the Pod is not terminated and doesn't have a deletionTimestamp,
   567     create a Workload for the pod group if one does not exist.
   568  2. Remove Pod finalizers if:
   569    - The Pod is terminated and the Workload is finished or has a deletion timestamp.
   570    - The Pod Failed and a valid replacement pod was created for it.
   571  3. Build the in-memory Workload. If its podset counters are greater than the stored Workload,
   572     then evict the Workload.
   573  4. For gated pods:
   574     - remove the gate, set nodeSelector
   575  5. If the number of succeeded pods is equal to the admission count, mark the Workload as Finished
   576     and remove the finalizers from the Pods.
   577  
   578  ### Retrying Failed Pods
   579  
   580  The Pod group will generally only be considered finished if all the Pods finish with a Succeeded
   581  phase.
   582  This allows the user to send replacement Pods when a Pod in the group fails or if the group is
   583  preempted. The replacement Pods can have any name, but they must point to the same pod group.
   584  Once a replacement Pod is created, and Kueue has added it as an owner of the Workload, the
   585  Failed pod will be finalized. If multiple Pods have Failed, a new Pod is assumed to replace 
   586  the Pod that failed first. 
   587  
   588  To declare that a group is failed, a user can execute one of the following actions:
   589  1. Issue a Delete for the Workload object. The controller would terminate all running Pods and
   590     clean up Pod finalizers.
   591  2. Add an annotation to any Pod in the group `kueue.x-k8s.io/retriable-in-group: false`.
   592    The annotation can be added to an existing Pod or added on creation.
   593  
   594    Kueue will consider a group finished if there are no running or pending Pods, and at
   595    least one terminated Pod (Failed or Succeeded) has the `retriable-in-group: false` annotation.
   596  
   597  ### Dynamically reclaiming Quota
   598  
   599  Succeeded Pods will not be considered replaceable. In other words, the quota
   600  from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim)
   601  in the Workload status.
   602  
   603  ### Metrics
   604  
   605  In addition to the existing metrics for workloads, it could be beneficial to track gated and
   606  unsuspended pods.
   607  
   608  - `pods_gated_total`: Tracks the number of pods that get the scheduling gate.
   609  - `pods_ungated_total`: Tracks the number of pods that get the scheduling gate removed.
   610  - `pods_rejected_total`: Tracks the number of pods that were rejected because there was an excess
   611    number of pods compared to the annotations.
   612  
   613  ### Test Plan
   614  
   615  <!--
   616  **Note:** *Not required until targeted at a release.*
   617  The goal is to ensure that we don't accept enhancements with inadequate testing.
   618  
   619  All code is expected to have adequate tests (eventually with coverage
   620  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   621  when drafting this test plan.
   622  
   623  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   624  -->
   625  
   626  [x] I/we understand the owners of the involved components may require updates to
   627  existing tests to make this code solid enough prior to committing the changes necessary
   628  to implement this enhancement.
   629  
   630  ##### Prerequisite testing updates
   631  
   632  The unit coverage of `workload_controller.go` needs significant improvement.
   633  
   634  #### Unit Tests
   635  
   636  Current coverage of packages that will be affected
   637  
   638  - `pkg/controller/jobframework/reconciler.go`: `2023-08-14` - `60.9%`
   639  - `pkg/controller/core/workload_controller.go`: `2023-08-14` - `7%`
   640  - `pkg/metrics`: `2023-08-14` - `97%`
   641  - `main.go`: `2023-08-14` - `16.4%`
   642  
   643  #### Integration tests
   644  
   645  The integration tests should cover the following scenarios:
   646  
   647  - Basic webhook test
   648  - Single Pod queued, admitted and finished.
   649  - Multiple Pods created beforehand:
   650    - queued and admitted
   651    - failed pods recreated can use the same quota
   652    - Group finished when all pods finish Successfully
   653    - Group finished when a Pod with `retriable-in-group: false` annotation finishes.
   654    - Group preempted and resumed.
   655    - Excess pods before admission, youngest pods are deleted.
   656    - Excess pods after admission, youngest pods per role are deleted.
   657  - Driver Pod creates workers:
   658    - queued and admitted.
   659    - worker pods beyond the count are rejected (deleted)
   660    - workload finished when all pods finish
   661  - Preemption deletes all pods for the workload.
   662  
   663  ### Graduation Criteria
   664  
   665  #### Beta
   666  
   667  The feature will be first released with a Beta maturity level. The feature will not be guarded by a
   668  feature gate. However, as opposed to the rest of the integrations, it will not be enabled by
   669  default: users have to explicitly enable Pod integration through the configuration API.
   670  
   671  #### GA
   672  
   673  The feature can graduate to GA after addressing feedback for at least 3 consecutive releases.
   674  
   675  ## Implementation History
   676  
   677  <!--
   678  Major milestones in the lifecycle of a KEP should be tracked in this section.
   679  Major milestones might include:
   680  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   681  - the `Proposal` section being merged, signaling agreement on a proposed design
   682  - the date implementation started
   683  - the first Kubernetes release where an initial version of the KEP was available
   684  - the version of Kubernetes where the KEP graduated to general availability
   685  - when the KEP was retired or superseded
   686  -->
   687  
   688  - Sep 29th: Implemented single Pod support (story 1) [#1103](https://github.com/kubernetes-sigs/kueue/pulls/1103).
   689  - Nov 24th: Implemented support for groups of Pods (story 2) [#1319](https://github.com/kubernetes-sigs/kueue/pulls/1319)
   690  
   691  ## Drawbacks
   692  
   693  The proposed labels and annotations for groups of pods can be complex to build manually.
   694  However, we expect that a job dispatcher or client would create the Pods, not end-users directly.
   695  
   696  For more complex scenarios, users should consider using a CRD to manage their Pods and integrate
   697  the CRD with Kueue.
   698  
   699  ## Alternatives
   700  
   701  ### Users create a Workload object beforehand
   702  
   703  An alternative to the multiple annotations in the Pods would be for users to create a Workload
   704  object before creating the Pods. The Pods would just have one annotation referencing the Workload
   705  name.
   706  
   707  While this would be a clean approach, this proposal is targetting users that don't have a CRD
   708  wrapping their Pods, and adding one would be a bigger effort than adding annotations. Such amount
   709  of effort could be similar to migrating from plain Pods to the Job API, which is already supported.
   710  
   711  We could reconsider this based on user feedback.