volcano.sh/volcano@v1.9.0/docs/design/delay-pod-creation.md (about) 1 # Delay Pod Creation 2 3 @k82cn; Jan 7, 2019 4 5 ## Table of Contents 6 7 * [Delay Pod Creation](#delay-pod-creation) 8 * [Table of Contents](#table-of-contents) 9 * [Motivation](#motivation) 10 * [Function Detail](#function-detail) 11 * [State](#state) 12 * [Action](#action) 13 * [Admission Webhook](#admission-webhook) 14 * [Feature interaction](#feature-interaction) 15 * [Queue](#queue) 16 * [Quota](#quota) 17 * [Operator/Controller](#operatorcontroller) 18 * [Others](#others) 19 * [Compatibility](#compatibility) 20 * [Roadmap](#roadmap) 21 * [Reference](#reference) 22 23 Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) 24 25 ## Motivation 26 27 For a batch system, there're always several pending jobs because of limited resources and throughput. 28 Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for 29 batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider). 30 In this document, several enhancements are introduced to delay pod creation. 31 32 ## Function Detail 33 34 ### State 35 36 A new state, named `InQueue`, will be introduced to denote the phase that jobs are ready to be allocated. 37 After `InQueue`, the state transform map is updated as follow. 38 39 | From | To | Reason | 40 |---------------|----------------|---------| 41 | Pending | InQueue | When it's ready to allocate resource to job | 42 | InQueue | Pending | When there's not enough resources anymore | 43 | InQueue | Running | When every pods of `spec.minMember` are running | 44 45 The `InQueue` is a new state between `Pending` and `Running`; and it'll let operators/controllers start to 46 create pods. If it meets errors, e.g. unschedulable, it rollbacks to `Pending` instead of `InQueue` to 47 avoid retry-loop. 48 49 ### Action 50 51 Currently, `kube-batch` supports several actions, e.g. `allocate`, `preempt`; but all those actions are executed 52 based on pending pods. To support `InQueue` state, a new action, named `enqueue`, will be introduced. 53 54 By default, `enqueue` action will handle `PodGroup`s in FCFS policy; `enqueue` will go through all PodGroup 55 (by creation timestamp) and update PodGroup's phase to `InQueue` if: 56 57 * there're enough idle resources for `spec.minResources` of `PodGroup` 58 * there're enough quota for `spec.minResources` of `PodGroup` 59 60 As `kube-batch` handling `PodGroup` by `spec.minResources`, the operator/controller may create more `Pod`s than 61 `spec.minResources`; in such case, `preempt` action will be enhanced to evict overused `PodGroup` to release 62 resources. 63 64 ### Admission Webhook 65 66 To guarantee the transaction of `spec.minResources`, a new `MutatingAdmissionWebhook`, named `PodGroupMinResources`, 67 is introduced. `PodGroupMinResources` make sure 68 69 * the summary of all PodGroups' `spec.minResources` in a namespace not more than `Quota` 70 * if resources are reserved by `spec.minResources`, the resources can not be used by others 71 72 Generally, it's better to let total `Quota` to be more than available resources in cluster, as some pods maybe 73 unschedulable because of scheduler's algorithm, e.g. predicates. 74 75 ## Feature interaction 76 77 ### Queue 78 79 The resources will be shared between `Queue`s algorithm, e.g. proportion by default. If the resources can not be 80 fully used because of fragment, `backfill` action will help on that. If `Queue` used more resources than its 81 deserved, `reclaim` action will help to balance resources. The Pod can not be evicted currently if eviction will 82 break `spec.minMember`; it'll be enhanced for job level eviction. 83 84 ### Quota 85 86 To delay pod creation, both `kube-batch` and `PodGroupMinResources` will watch `ResourceQuota` to decide which 87 `PodGroup` should be in queue firstly. The decision maybe invalid because of race condition, e.g. other 88 controllers create Pods. In such case, `PodGroupMinResources` will reject `PodGroup` creation and keep `InQueue` 89 state until `kube-batch` transform it back to `Pending`. To avoid race condition, it's better to let `kube-batch` 90 manage `Pod` number and resources (e.g. CPU, memory) instead of `Quota`. 91 92 ### Operator/Controller 93 94 The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component, 95 named `PodGroupController`, will be introduced later to enforce this protocol if necessary. 96 97 ## Others 98 99 ### Compatibility 100 101 To support this new feature, a new state and a new action are introduced; so when the new `enqueue` action is 102 disabled in the configuration, it'll keep the same behaviour as before. 103 104 ## Roadmap 105 106 * `InQueue` phase and `enqueue` action (v0.5+) 107 * Admission Controller (v0.6+) 108 109 ## Reference 110 111 * [Coscheduling](https://github.com/kubernetes/enhancements/pull/639) 112 * [Delay Pod creation](https://github.com/kubernetes-sigs/kube-batch/issues/539) 113 * [PodGroup Status](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/podgroup-status.md) 114 * [Support 'spec.TotalResources' in PodGroup](https://github.com/kubernetes-sigs/kube-batch/issues/401) 115 * [Dynamic Admission Control](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server) 116 * [Add support for podGroup number limits for one queue](https://github.com/kubernetes-sigs/kube-batch/issues/452)