sigs.k8s.io/kueue@v0.6.2/keps/369-job-interface/README.md (about)

     1  # KEP-369: Add job interface to help integrating Kueue with other job-like applications
     2  
     3  <!--
     4  This is the title of your KEP. Keep it short, simple, and descriptive. A good
     5  title can help communicate what the KEP is and should be considered as part of
     6  any review.
     7  -->
     8  
     9  <!--
    10  A table of contents is helpful for quickly jumping to sections of a KEP and for
    11  highlighting any additional information provided beyond the standard KEP
    12  template.
    13  
    14  Ensure the TOC is wrapped with
    15    <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
    16  tags, and then generate with `hack/update-toc.sh`.
    17  -->
    18  
    19  <!-- toc -->
    20  - [Summary](#summary)
    21  - [Motivation](#motivation)
    22    - [Goals](#goals)
    23    - [Non-Goals](#non-goals)
    24  - [Proposal](#proposal)
    25    - [User Stories (Optional)](#user-stories-optional)
    26      - [Story 1](#story-1)
    27    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
    28    - [Risks and Mitigations](#risks-and-mitigations)
    29  - [Design Details](#design-details)
    30    - [Job Interface](#job-interface)
    31    - [Kubernetes Job implementation details](#kubernetes-job-implementation-details)
    32    - [Working Flow in pseudo-code](#working-flow-in-pseudo-code)
    33    - [Test Plan](#test-plan)
    34        - [Prerequisite testing updates](#prerequisite-testing-updates)
    35      - [Unit Tests](#unit-tests)
    36      - [Integration tests](#integration-tests)
    37    - [Graduation Criteria](#graduation-criteria)
    38  - [Implementation History](#implementation-history)
    39  - [Drawbacks](#drawbacks)
    40  - [Alternatives](#alternatives)
    41  <!-- /toc -->
    42  
    43  ## Summary
    44  
    45  <!--
    46  This section is incredibly important for producing high-quality, user-focused
    47  documentation such as release notes or a development roadmap. It should be
    48  possible to collect this information before implementation begins, in order to
    49  avoid requiring implementors to split their attention between writing release
    50  notes and implementing the feature itself. KEP editors and SIG Docs
    51  should help to ensure that the tone and content of the `Summary` section is
    52  useful for a wide audience.
    53  
    54  A good summary is probably at least a paragraph in length.
    55  
    56  Both in this section and below, follow the guidelines of the [documentation
    57  style guide]. In particular, wrap lines to a reasonable length, to make it
    58  easier for reviewers to cite specific portions, and to minimize diff churn on
    59  updates.
    60  
    61  [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
    62  -->
    63  
    64  Define a set of methods(known as golang interface) to shape the default behaviors of job.
    65  In addition, provide a full controller and one implementation example(based on Kubernetes Job) for developers
    66  to follow when building their own controllers.
    67  This will help the community to integrate with Kueue more easily.
    68  
    69  ## Motivation
    70  
    71  <!--
    72  This section is for explicitly listing the motivation, goals, and non-goals of
    73  this KEP.  Describe why the change is important and the benefits to users. The
    74  motivation section can optionally provide links to [experience reports] to
    75  demonstrate the interest in a KEP within the wider Kubernetes community.
    76  
    77  [experience reports]: https://github.com/golang/go/wiki/ExperienceReports
    78  -->
    79  
    80  From day 0 in Kueue, we natively support Kubernetes Job by leveraging the capacity of [_suspend_](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job),
    81  this helps us to build a multi-tenant job queueing system in Kubernetes, this is attractive to other job-like applications
    82  like MPIJob, who lacks the capacity of queueing.
    83  
    84  The good news is Kueue is extensible and simple to integrate with through the intermediate medium, we named _Workload_ in Kueue,
    85  what we need to do is to build a controller to reconcile the workload and the job itself.
    86  
    87  But the complexity lays in developers who are familiar with job-like applications may have little knowledge of the
    88  implementation details of Kueue and they have no idea where to start to build the controller. In this case, if we can provide an
    89  interface which defines the default behaviors of the Job, and serve the Kubernetes Job as a standard template, it will do them
    90  a great favor.
    91  
    92  ### Goals
    93  
    94  <!--
    95  List the specific goals of the KEP. What is it trying to achieve? How will we
    96  know that this has succeeded?
    97  -->
    98  
    99  - Define an interface which shapes the default behaviors of Job
   100  - Provide a full controller implementation which can be reused for different jobs
   101  - Make Kubernetes Job an implementation template of the interface
   102  
   103  ### Non-Goals
   104  
   105  <!--
   106  What is out of scope for this KEP? Listing non-goals helps to focus discussion
   107  and make progress.
   108  -->
   109  
   110  - Integrate any job-like applications
   111  
   112  ## Proposal
   113  
   114  <!--
   115  This is where we get down to the specifics of what the proposal actually is.
   116  This should have enough detail that reviewers can understand exactly what
   117  you're proposing, but should not include things like API designs or
   118  implementation. What is the desired outcome and how do we measure success?.
   119  The "Design Details" section below is for the real
   120  nitty-gritty.
   121  -->
   122  
   123  ### User Stories (Optional)
   124  
   125  <!--
   126  Detail the things that people will be able to do if this KEP is implemented.
   127  Include as much detail as possible so that people can understand the "how" of
   128  the system. The goal here is to make this feel real for users without getting
   129  bogged down.
   130  -->
   131  
   132  #### Story 1
   133  
   134  We collected feedback from the community about how to fully integrate Kueue with MPIJob,
   135  see [#499](https://github.com/kubernetes-sigs/kueue/issues/499).
   136  
   137  ### Notes/Constraints/Caveats (Optional)
   138  
   139  <!--
   140  What are the caveats to the proposal?
   141  What are some important details that didn't come across above?
   142  Go in to as much detail as necessary here.
   143  This might be a good place to talk about core concepts and how they relate.
   144  -->
   145  
   146  - Job interface defined here is a hint for developers to build their own controllers.
   147  It's a hard constrain if they wish to use the controller, but they can always write the controllers from scratch.
   148  - This will increase the code complexity by wrapping the original Jobs.
   149  
   150  ### Risks and Mitigations
   151  
   152  <!--
   153  What are the risks of this proposal, and how do we mitigate? Think broadly.
   154  For example, consider both security and how this will impact the larger
   155  Kubernetes ecosystem.
   156  
   157  How will security be reviewed, and by whom?
   158  
   159  How will UX be reviewed, and by whom?
   160  
   161  Consider including folks who also work outside the SIG or subproject.
   162  -->
   163  
   164  Provide a full controller may lead to the interface changes more frequently,
   165  we can make the interface as small as possible to mitigate this.
   166  
   167  ## Design Details
   168  
   169  <!--
   170  This section should contain enough information that the specifics of your
   171  change are understandable. This may include API specs (though not always
   172  required) or even code snippets. If there's any ambiguity about HOW your
   173  proposal will be implemented, this is the place to discuss them.
   174  -->
   175  
   176  ### Job Interface
   177  
   178  We will define a new interface named GenericJob, this should be implemented by custom job-like applications:
   179  
   180  ```golang
   181  type GenericJob interface {
   182    // Object returns the job instance.
   183    Object() client.Object
   184    // IsSuspended returns whether the job is suspend or not.
   185    IsSuspended() bool
   186    // Suspend will suspend the job.
   187    Suspend() error
   188    // UnSuspend will unsuspend the job.
   189    UnSuspend() error
   190    // InjectNodeAffinity will inject the node affinity extracting from workload to job.
   191    InjectNodeSelectors(nodeSelectors []map[string]string) error
   192    // RestoreNodeAffinity will restore the original node affinity of job.
   193    RestoreNodeSelectors(nodeSelectors []map[string]string) error
   194    // Finished means whether the job is completed/failed or not,
   195    Finished()  finished bool
   196    // PodSets returns the podSets corresponding to the job.
   197    PodSets() []kueue.PodSet
   198    // EquivalentToWorkload validates whether the workload is semantically equal to the job.
   199    EquivalentToWorkload(wl kueue.Workload) bool
   200    // PriorityClass returns the job's priority class name.
   201    PriorityClass() string
   202    // QueueName returns the queue name the job enqueued.
   203    QueueName() string
   204    // IgnoreFromQueueing returns whether the job should be ignored in queuing, e.g. lacking the queueName.
   205    IgnoreFromQueueing() bool
   206    // PodsReady instructs whether job derived pods are all ready now.
   207    PodsReady() bool
   208  ```
   209  
   210  ### Kubernetes Job implementation details
   211  
   212  We'll wrap the batchv1.Job to `BatchJob` who implements the GenericJob interface.
   213  
   214  ```golang
   215  type BatchJob struct {
   216    batchv1.Job
   217  }
   218  
   219  var _ GenericJob = &BatchJob{}
   220  ```
   221  
   222  Besides, we'll provide a full controller for developers to follow, all they need to do is just implement the _GenericJob_ interface.
   223  
   224  ```golang
   225  type reconcileOptions struct {
   226    client                     client.Client
   227    scheme                     *runtime.Scheme
   228    record                     record.EventRecorder
   229    manageJobsWithoutQueueName bool
   230    waitForPodsReady           bool
   231  }
   232  
   233  func GenericReconcile(ctx context.Context, req ctrl.Request, reconcileOptions) (ctrl.Result, error) {
   234      // generic logics here
   235  }
   236  
   237  // Take batchv1.Job for example, all we want to do is just calling the GenericReconcile()
   238  func (r *JobReconciler) Reconcile(ctx context.Context, req ctrl.Request, job GenericInterface) (ctrl.Result, error) {
   239    var batchJob BatchJob
   240    return GenericReconcile(ctx, req, &batchJob, reconcileOptions)
   241  }
   242  
   243  
   244  ```
   245  
   246  ### Working Flow in pseudo-code
   247  
   248  ```golang
   249  GenericReconcile:
   250      // Ignore unmanaged jobs, like lacking queueName.
   251      if job.Ignored():
   252          return
   253  
   254      // Ensure there's only one corresponding workload and
   255      // return the matched workload, it could be nil.
   256      workload = EnsureOneWorkload()
   257  
   258      // Handing job is finished.
   259      if job.Finished():
   260          // Processing marking workload finished if not.
   261          SetWorkloadCondition()
   262          return
   263  
   264      // Handing workload is nil.
   265      if workload == nil:
   266          // If workload is nil, the job should be unsuspend.
   267          if !job.IsSuspend():
   268              // When stopping the job, we'll call Suspend(), RestoreNodeAffinity() etc.,
   269              // and update the job with client.
   270              StopJob()
   271  
   272          // When creating workload, we'll call PodSets(), QueueName(), PodsCount() etc.
   273          // to fill up the workload.
   274          workload = CreateWorkload()
   275          // creating the constructed workload with client
   276          // ...
   277  
   278      // Handing job is suspend.
   279      if job.IsSuspend():
   280          // If job is suspend but workload is admitted,
   281          // we should start the job.
   282          if workload.Spec.Admission != nil:
   283              // When starting the job, we'll call Unsuspend(), InjectNodeAffinity() etc..
   284              StartJob()
   285              return
   286  
   287          // If job is suspend but we changed its queueName,
   288          // we should update the workload's queueName.
   289          // ...
   290  
   291      // Handing job is unsuspend.
   292  
   293      // If job is unsuspend but workload is unadmitted,
   294      // we should suspend the job.
   295      if workload.Spec.Admission == nil:
   296          StopJob()
   297          return
   298  
   299      // Processing other logics like all-or-nothing scheduling
   300      // ...
   301  ```
   302  
   303  ### Test Plan
   304  
   305  <!--
   306  **Note:** *Not required until targeted at a release.*
   307  The goal is to ensure that we don't accept enhancements with inadequate testing.
   308  
   309  All code is expected to have adequate tests (eventually with coverage
   310  expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
   311  when drafting this test plan.
   312  
   313  [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
   314  -->
   315  
   316  [x] I/we understand the owners of the involved components may require updates to
   317  existing tests to make this code solid enough prior to committing the changes necessary
   318  to implement this enhancement.
   319  
   320  ##### Prerequisite testing updates
   321  
   322  <!--
   323  Based on reviewers feedback describe what additional tests need to be added prior
   324  implementing this enhancement to ensure the enhancements have also solid foundations.
   325  -->
   326  
   327  No.
   328  
   329  #### Unit Tests
   330  
   331  <!--
   332  In principle every added code should have complete unit test coverage, so providing
   333  the exact set of tests will not bring additional value.
   334  However, if complete unit test coverage is not possible, explain the reason of it
   335  together with explanation why this is acceptable.
   336  -->
   337  
   338  <!--
   339  Additionally, try to enumerate the core package you will be touching
   340  to implement this enhancement and provide the current unit coverage for those
   341  in the form of:
   342  - <package>: <date> - <current test coverage>
   343  
   344  This can inform certain test coverage improvements that we want to do before
   345  extending the production code to implement this enhancement.
   346  -->
   347  
   348  - `pkg/controller/workload/job`: `2023.01.30` - `5.5%` (This is output via the go tool)
   349  
   350  #### Integration tests
   351  
   352  <!--
   353  Describe what tests will be added to ensure proper quality of the enhancement.
   354  
   355  After the implementation PR is merged, add the names of the tests here.
   356  -->
   357  
   358  This is more like a refactor of the current implementation, theoretically no need to add more
   359  integration tests.
   360  
   361  ### Graduation Criteria
   362  
   363  <!--
   364  
   365  Clearly define what it means for the feature to be implemented and
   366  considered stable.
   367  
   368  If the feature you are introducing has high complexity, consider adding graduation
   369  milestones with these graduation criteria:
   370  - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
   371  - [Feature gate][feature gate] lifecycle
   372  - [Deprecation policy][deprecation-policy]
   373  
   374  [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
   375  [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
   376  [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
   377  -->
   378  
   379  ## Implementation History
   380  
   381  <!--
   382  Major milestones in the lifecycle of a KEP should be tracked in this section.
   383  Major milestones might include:
   384  - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
   385  - the `Proposal` section being merged, signaling agreement on a proposed design
   386  - the date implementation started
   387  - the first Kubernetes release where an initial version of the KEP was available
   388  - the version of Kubernetes where the KEP graduated to general availability
   389  - when the KEP was retired or superseded
   390  -->
   391  
   392  - 2023.01.09: KEP proposed for review, including motivation, proposal, risks,
   393  test plan.
   394  
   395  ## Drawbacks
   396  
   397  <!--
   398  Why should this KEP _not_ be implemented?
   399  -->
   400  
   401  - It will increase some maintenance costs, like if we change the interface,
   402  so we should minimize this kind of changes.
   403  
   404  ## Alternatives
   405  
   406  <!--
   407  What other approaches did you consider, and why did you rule them out? These do
   408  not need to be as detailed as the proposal, but should include enough
   409  information to express the idea and why it was not acceptable.
   410  -->
   411  
   412  Each job implements their controller from scratch.