sigs.k8s.io/kueue@v0.6.2/keps/369-job-interface/README.md (about) 1 # KEP-369: Add job interface to help integrating Kueue with other job-like applications 2 3 <!-- 4 This is the title of your KEP. Keep it short, simple, and descriptive. A good 5 title can help communicate what the KEP is and should be considered as part of 6 any review. 7 --> 8 9 <!-- 10 A table of contents is helpful for quickly jumping to sections of a KEP and for 11 highlighting any additional information provided beyond the standard KEP 12 template. 13 14 Ensure the TOC is wrapped with 15 <code><!-- toc --&rt;<!-- /toc --&rt;</code> 16 tags, and then generate with `hack/update-toc.sh`. 17 --> 18 19 <!-- toc --> 20 - [Summary](#summary) 21 - [Motivation](#motivation) 22 - [Goals](#goals) 23 - [Non-Goals](#non-goals) 24 - [Proposal](#proposal) 25 - [User Stories (Optional)](#user-stories-optional) 26 - [Story 1](#story-1) 27 - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) 28 - [Risks and Mitigations](#risks-and-mitigations) 29 - [Design Details](#design-details) 30 - [Job Interface](#job-interface) 31 - [Kubernetes Job implementation details](#kubernetes-job-implementation-details) 32 - [Working Flow in pseudo-code](#working-flow-in-pseudo-code) 33 - [Test Plan](#test-plan) 34 - [Prerequisite testing updates](#prerequisite-testing-updates) 35 - [Unit Tests](#unit-tests) 36 - [Integration tests](#integration-tests) 37 - [Graduation Criteria](#graduation-criteria) 38 - [Implementation History](#implementation-history) 39 - [Drawbacks](#drawbacks) 40 - [Alternatives](#alternatives) 41 <!-- /toc --> 42 43 ## Summary 44 45 <!-- 46 This section is incredibly important for producing high-quality, user-focused 47 documentation such as release notes or a development roadmap. It should be 48 possible to collect this information before implementation begins, in order to 49 avoid requiring implementors to split their attention between writing release 50 notes and implementing the feature itself. KEP editors and SIG Docs 51 should help to ensure that the tone and content of the `Summary` section is 52 useful for a wide audience. 53 54 A good summary is probably at least a paragraph in length. 55 56 Both in this section and below, follow the guidelines of the [documentation 57 style guide]. In particular, wrap lines to a reasonable length, to make it 58 easier for reviewers to cite specific portions, and to minimize diff churn on 59 updates. 60 61 [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md 62 --> 63 64 Define a set of methods(known as golang interface) to shape the default behaviors of job. 65 In addition, provide a full controller and one implementation example(based on Kubernetes Job) for developers 66 to follow when building their own controllers. 67 This will help the community to integrate with Kueue more easily. 68 69 ## Motivation 70 71 <!-- 72 This section is for explicitly listing the motivation, goals, and non-goals of 73 this KEP. Describe why the change is important and the benefits to users. The 74 motivation section can optionally provide links to [experience reports] to 75 demonstrate the interest in a KEP within the wider Kubernetes community. 76 77 [experience reports]: https://github.com/golang/go/wiki/ExperienceReports 78 --> 79 80 From day 0 in Kueue, we natively support Kubernetes Job by leveraging the capacity of [_suspend_](https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job), 81 this helps us to build a multi-tenant job queueing system in Kubernetes, this is attractive to other job-like applications 82 like MPIJob, who lacks the capacity of queueing. 83 84 The good news is Kueue is extensible and simple to integrate with through the intermediate medium, we named _Workload_ in Kueue, 85 what we need to do is to build a controller to reconcile the workload and the job itself. 86 87 But the complexity lays in developers who are familiar with job-like applications may have little knowledge of the 88 implementation details of Kueue and they have no idea where to start to build the controller. In this case, if we can provide an 89 interface which defines the default behaviors of the Job, and serve the Kubernetes Job as a standard template, it will do them 90 a great favor. 91 92 ### Goals 93 94 <!-- 95 List the specific goals of the KEP. What is it trying to achieve? How will we 96 know that this has succeeded? 97 --> 98 99 - Define an interface which shapes the default behaviors of Job 100 - Provide a full controller implementation which can be reused for different jobs 101 - Make Kubernetes Job an implementation template of the interface 102 103 ### Non-Goals 104 105 <!-- 106 What is out of scope for this KEP? Listing non-goals helps to focus discussion 107 and make progress. 108 --> 109 110 - Integrate any job-like applications 111 112 ## Proposal 113 114 <!-- 115 This is where we get down to the specifics of what the proposal actually is. 116 This should have enough detail that reviewers can understand exactly what 117 you're proposing, but should not include things like API designs or 118 implementation. What is the desired outcome and how do we measure success?. 119 The "Design Details" section below is for the real 120 nitty-gritty. 121 --> 122 123 ### User Stories (Optional) 124 125 <!-- 126 Detail the things that people will be able to do if this KEP is implemented. 127 Include as much detail as possible so that people can understand the "how" of 128 the system. The goal here is to make this feel real for users without getting 129 bogged down. 130 --> 131 132 #### Story 1 133 134 We collected feedback from the community about how to fully integrate Kueue with MPIJob, 135 see [#499](https://github.com/kubernetes-sigs/kueue/issues/499). 136 137 ### Notes/Constraints/Caveats (Optional) 138 139 <!-- 140 What are the caveats to the proposal? 141 What are some important details that didn't come across above? 142 Go in to as much detail as necessary here. 143 This might be a good place to talk about core concepts and how they relate. 144 --> 145 146 - Job interface defined here is a hint for developers to build their own controllers. 147 It's a hard constrain if they wish to use the controller, but they can always write the controllers from scratch. 148 - This will increase the code complexity by wrapping the original Jobs. 149 150 ### Risks and Mitigations 151 152 <!-- 153 What are the risks of this proposal, and how do we mitigate? Think broadly. 154 For example, consider both security and how this will impact the larger 155 Kubernetes ecosystem. 156 157 How will security be reviewed, and by whom? 158 159 How will UX be reviewed, and by whom? 160 161 Consider including folks who also work outside the SIG or subproject. 162 --> 163 164 Provide a full controller may lead to the interface changes more frequently, 165 we can make the interface as small as possible to mitigate this. 166 167 ## Design Details 168 169 <!-- 170 This section should contain enough information that the specifics of your 171 change are understandable. This may include API specs (though not always 172 required) or even code snippets. If there's any ambiguity about HOW your 173 proposal will be implemented, this is the place to discuss them. 174 --> 175 176 ### Job Interface 177 178 We will define a new interface named GenericJob, this should be implemented by custom job-like applications: 179 180 ```golang 181 type GenericJob interface { 182 // Object returns the job instance. 183 Object() client.Object 184 // IsSuspended returns whether the job is suspend or not. 185 IsSuspended() bool 186 // Suspend will suspend the job. 187 Suspend() error 188 // UnSuspend will unsuspend the job. 189 UnSuspend() error 190 // InjectNodeAffinity will inject the node affinity extracting from workload to job. 191 InjectNodeSelectors(nodeSelectors []map[string]string) error 192 // RestoreNodeAffinity will restore the original node affinity of job. 193 RestoreNodeSelectors(nodeSelectors []map[string]string) error 194 // Finished means whether the job is completed/failed or not, 195 Finished() finished bool 196 // PodSets returns the podSets corresponding to the job. 197 PodSets() []kueue.PodSet 198 // EquivalentToWorkload validates whether the workload is semantically equal to the job. 199 EquivalentToWorkload(wl kueue.Workload) bool 200 // PriorityClass returns the job's priority class name. 201 PriorityClass() string 202 // QueueName returns the queue name the job enqueued. 203 QueueName() string 204 // IgnoreFromQueueing returns whether the job should be ignored in queuing, e.g. lacking the queueName. 205 IgnoreFromQueueing() bool 206 // PodsReady instructs whether job derived pods are all ready now. 207 PodsReady() bool 208 ``` 209 210 ### Kubernetes Job implementation details 211 212 We'll wrap the batchv1.Job to `BatchJob` who implements the GenericJob interface. 213 214 ```golang 215 type BatchJob struct { 216 batchv1.Job 217 } 218 219 var _ GenericJob = &BatchJob{} 220 ``` 221 222 Besides, we'll provide a full controller for developers to follow, all they need to do is just implement the _GenericJob_ interface. 223 224 ```golang 225 type reconcileOptions struct { 226 client client.Client 227 scheme *runtime.Scheme 228 record record.EventRecorder 229 manageJobsWithoutQueueName bool 230 waitForPodsReady bool 231 } 232 233 func GenericReconcile(ctx context.Context, req ctrl.Request, reconcileOptions) (ctrl.Result, error) { 234 // generic logics here 235 } 236 237 // Take batchv1.Job for example, all we want to do is just calling the GenericReconcile() 238 func (r *JobReconciler) Reconcile(ctx context.Context, req ctrl.Request, job GenericInterface) (ctrl.Result, error) { 239 var batchJob BatchJob 240 return GenericReconcile(ctx, req, &batchJob, reconcileOptions) 241 } 242 243 244 ``` 245 246 ### Working Flow in pseudo-code 247 248 ```golang 249 GenericReconcile: 250 // Ignore unmanaged jobs, like lacking queueName. 251 if job.Ignored(): 252 return 253 254 // Ensure there's only one corresponding workload and 255 // return the matched workload, it could be nil. 256 workload = EnsureOneWorkload() 257 258 // Handing job is finished. 259 if job.Finished(): 260 // Processing marking workload finished if not. 261 SetWorkloadCondition() 262 return 263 264 // Handing workload is nil. 265 if workload == nil: 266 // If workload is nil, the job should be unsuspend. 267 if !job.IsSuspend(): 268 // When stopping the job, we'll call Suspend(), RestoreNodeAffinity() etc., 269 // and update the job with client. 270 StopJob() 271 272 // When creating workload, we'll call PodSets(), QueueName(), PodsCount() etc. 273 // to fill up the workload. 274 workload = CreateWorkload() 275 // creating the constructed workload with client 276 // ... 277 278 // Handing job is suspend. 279 if job.IsSuspend(): 280 // If job is suspend but workload is admitted, 281 // we should start the job. 282 if workload.Spec.Admission != nil: 283 // When starting the job, we'll call Unsuspend(), InjectNodeAffinity() etc.. 284 StartJob() 285 return 286 287 // If job is suspend but we changed its queueName, 288 // we should update the workload's queueName. 289 // ... 290 291 // Handing job is unsuspend. 292 293 // If job is unsuspend but workload is unadmitted, 294 // we should suspend the job. 295 if workload.Spec.Admission == nil: 296 StopJob() 297 return 298 299 // Processing other logics like all-or-nothing scheduling 300 // ... 301 ``` 302 303 ### Test Plan 304 305 <!-- 306 **Note:** *Not required until targeted at a release.* 307 The goal is to ensure that we don't accept enhancements with inadequate testing. 308 309 All code is expected to have adequate tests (eventually with coverage 310 expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] 311 when drafting this test plan. 312 313 [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md 314 --> 315 316 [x] I/we understand the owners of the involved components may require updates to 317 existing tests to make this code solid enough prior to committing the changes necessary 318 to implement this enhancement. 319 320 ##### Prerequisite testing updates 321 322 <!-- 323 Based on reviewers feedback describe what additional tests need to be added prior 324 implementing this enhancement to ensure the enhancements have also solid foundations. 325 --> 326 327 No. 328 329 #### Unit Tests 330 331 <!-- 332 In principle every added code should have complete unit test coverage, so providing 333 the exact set of tests will not bring additional value. 334 However, if complete unit test coverage is not possible, explain the reason of it 335 together with explanation why this is acceptable. 336 --> 337 338 <!-- 339 Additionally, try to enumerate the core package you will be touching 340 to implement this enhancement and provide the current unit coverage for those 341 in the form of: 342 - <package>: <date> - <current test coverage> 343 344 This can inform certain test coverage improvements that we want to do before 345 extending the production code to implement this enhancement. 346 --> 347 348 - `pkg/controller/workload/job`: `2023.01.30` - `5.5%` (This is output via the go tool) 349 350 #### Integration tests 351 352 <!-- 353 Describe what tests will be added to ensure proper quality of the enhancement. 354 355 After the implementation PR is merged, add the names of the tests here. 356 --> 357 358 This is more like a refactor of the current implementation, theoretically no need to add more 359 integration tests. 360 361 ### Graduation Criteria 362 363 <!-- 364 365 Clearly define what it means for the feature to be implemented and 366 considered stable. 367 368 If the feature you are introducing has high complexity, consider adding graduation 369 milestones with these graduation criteria: 370 - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] 371 - [Feature gate][feature gate] lifecycle 372 - [Deprecation policy][deprecation-policy] 373 374 [feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md 375 [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions 376 [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ 377 --> 378 379 ## Implementation History 380 381 <!-- 382 Major milestones in the lifecycle of a KEP should be tracked in this section. 383 Major milestones might include: 384 - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance 385 - the `Proposal` section being merged, signaling agreement on a proposed design 386 - the date implementation started 387 - the first Kubernetes release where an initial version of the KEP was available 388 - the version of Kubernetes where the KEP graduated to general availability 389 - when the KEP was retired or superseded 390 --> 391 392 - 2023.01.09: KEP proposed for review, including motivation, proposal, risks, 393 test plan. 394 395 ## Drawbacks 396 397 <!-- 398 Why should this KEP _not_ be implemented? 399 --> 400 401 - It will increase some maintenance costs, like if we change the interface, 402 so we should minimize this kind of changes. 403 404 ## Alternatives 405 406 <!-- 407 What other approaches did you consider, and why did you rule them out? These do 408 not need to be as detailed as the proposal, but should include enough 409 information to express the idea and why it was not acceptable. 410 --> 411 412 Each job implements their controller from scratch.