volcano.sh/volcano@v1.9.0/docs/design/delay-pod-creation.md (about)

     1  # Delay Pod Creation
     2  
     3  @k82cn; Jan 7, 2019
     4  
     5  ## Table of Contents
     6  
     7     * [Delay Pod Creation](#delay-pod-creation)
     8        * [Table of Contents](#table-of-contents)
     9        * [Motivation](#motivation)
    10        * [Function Detail](#function-detail)
    11           * [State](#state)
    12           * [Action](#action)
    13           * [Admission Webhook](#admission-webhook)
    14        * [Feature interaction](#feature-interaction)
    15           * [Queue](#queue)
    16           * [Quota](#quota)
    17           * [Operator/Controller](#operatorcontroller)
    18        * [Others](#others)
    19           * [Compatibility](#compatibility)
    20        * [Roadmap](#roadmap)
    21        * [Reference](#reference)
    22  
    23  Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
    24  
    25  ## Motivation
    26  
    27  For a batch system, there're always several pending jobs because of limited resources and throughput.
    28  Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for
    29  batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider).
    30  In this document, several enhancements are introduced to delay pod creation.
    31  
    32  ## Function Detail
    33  
    34  ### State
    35  
    36  A new state, named `InQueue`, will be introduced to denote the phase that jobs are ready to be allocated.
    37  After `InQueue`, the state transform map is updated as follow.
    38  
    39  | From          | To             | Reason  |
    40  |---------------|----------------|---------|
    41  | Pending       | InQueue        | When it's ready to allocate resource to job |
    42  | InQueue       | Pending        | When there's not enough resources anymore |
    43  | InQueue       | Running        | When every pods of `spec.minMember` are running |
    44  
    45  The `InQueue` is a new state between `Pending` and `Running`; and it'll let operators/controllers start to
    46  create pods. If it meets errors, e.g. unschedulable, it rollbacks to `Pending` instead of `InQueue` to
    47  avoid retry-loop.
    48  
    49  ### Action
    50  
    51  Currently, `kube-batch` supports several actions, e.g. `allocate`, `preempt`; but all those actions are executed
    52  based on pending pods. To support `InQueue` state, a new action, named `enqueue`, will be introduced.
    53  
    54  By default, `enqueue` action will handle `PodGroup`s in FCFS policy; `enqueue` will go through all PodGroup
    55  (by creation timestamp) and update PodGroup's phase to `InQueue` if:
    56  
    57  * there're enough idle resources for `spec.minResources` of `PodGroup`
    58  * there're enough quota for `spec.minResources` of `PodGroup`
    59  
    60  As `kube-batch` handling `PodGroup` by `spec.minResources`, the operator/controller may create more `Pod`s than
    61  `spec.minResources`; in such case, `preempt` action will be enhanced to evict overused `PodGroup` to release
    62  resources.
    63  
    64  ### Admission Webhook
    65  
    66  To guarantee the transaction of `spec.minResources`, a new `MutatingAdmissionWebhook`, named `PodGroupMinResources`,
    67  is introduced. `PodGroupMinResources` make sure
    68  
    69  * the summary of all PodGroups' `spec.minResources` in a namespace not more than `Quota`
    70  * if resources are reserved by `spec.minResources`, the resources can not be used by others
    71  
    72  Generally, it's better to let total `Quota` to be more than available resources in cluster, as some pods maybe
    73  unschedulable because of scheduler's algorithm, e.g. predicates.
    74  
    75  ## Feature interaction
    76  
    77  ### Queue
    78  
    79  The resources will be shared between `Queue`s algorithm, e.g. proportion by default. If the resources can not be
    80  fully used because of fragment, `backfill` action will help on that. If `Queue` used more resources than its
    81  deserved, `reclaim` action will help to balance resources. The Pod can not be evicted currently if eviction will
    82  break `spec.minMember`; it'll be enhanced for job level eviction.
    83  
    84  ### Quota
    85  
    86  To delay pod creation, both `kube-batch` and `PodGroupMinResources` will watch `ResourceQuota` to decide which
    87  `PodGroup` should be in queue firstly. The decision maybe invalid because of race condition, e.g. other
    88  controllers create Pods. In such case, `PodGroupMinResources` will reject `PodGroup` creation and keep `InQueue`
    89  state until `kube-batch` transform it back to `Pending`. To avoid race condition, it's better to let `kube-batch`
    90  manage `Pod` number and resources (e.g. CPU, memory) instead of `Quota`.
    91  
    92  ### Operator/Controller
    93  
    94  The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component,
    95  named `PodGroupController`, will be introduced later to enforce this protocol if necessary.
    96  
    97  ## Others
    98  
    99  ### Compatibility
   100  
   101  To support this new feature, a new state and a new action are introduced; so when the new `enqueue` action is
   102  disabled in the configuration, it'll keep the same behaviour as before.
   103  
   104  ## Roadmap
   105  
   106  * `InQueue` phase and `enqueue` action (v0.5+)
   107  * Admission Controller (v0.6+)
   108  
   109  ## Reference
   110  
   111  * [Coscheduling](https://github.com/kubernetes/enhancements/pull/639)
   112  * [Delay Pod creation](https://github.com/kubernetes-sigs/kube-batch/issues/539)
   113  * [PodGroup Status](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/podgroup-status.md)
   114  * [Support 'spec.TotalResources' in PodGroup](https://github.com/kubernetes-sigs/kube-batch/issues/401)
   115  * [Dynamic Admission Control](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server)
   116  * [Add support for podGroup number limits for one queue](https://github.com/kubernetes-sigs/kube-batch/issues/452)