volcano.sh/volcano@v1.9.0/docs/design/rescheduling.md

volcano.sh/volcano@v1.9.0/docs/design/rescheduling.md (about)

     1  # Rescheduling
     2  
     3  @[Thor-wl](https://github.com/Thor-wl); Dec 25th, 2021
     4  
     5  ## Motivation
     6  As what [Issue1777](https://github.com/volcano-sh/volcano/issues/1777) describes, **Rescheduling** is important for the 
     7  following reasons:
     8  * Unbalanced resource utilization due to unreasonable scheduling strategies and dynamic changes in jobs' lifecycle.
     9  * Node status changes such as add/remove nodes, pod/node taint/affinity changes.
    10  
    11  In order to rebalance the cluster resource utilization among nodes, we want to achieve these goals:
    12  * Rescheduling pods based on real resource utilization instead of request resource.
    13  * Support custom configured rescheduling strategies.
    14  
    15  ## Design
    16  ### WorkFlow
    17  1. Filter Resources which are selected to be evicted potentially according to the filter chain, for example, queue filter
    18  and label filter.
    19  2. Get through the chain of rescheduling strategies and filter resources which are to be evicted.
    20  3. Evict the Pods attached to these resources.
    21  4. Execute the process above periodically.
    22  
    23  ### Resource Filter
    24  * Queue Filter
    25  
    26    `Queue filtering` will filter resources in specified queue. Then the rescheduling process only works on the result set.
    27  
    28  * Label Filter
    29  
    30    `Label Filter` will filter pods with specified labels. Then the rescheduling process only works on the result set.
    31  
    32  ### Rescheduling Strategy
    33  * OfflineOnly
    34  
    35      `OfflineOnly`, abbreviated as `OLO`, will only pick out offline workloads, which are attached with annotation
    36  `preemptable: true`, and then reschedule them only.
    37  
    38  * LowPriorityFirst
    39  
    40      `LowPriorityFirst`, abbreviated as `LPF`, will sort workloads by priority and then reschedule pods in ascending 
    41  order.
    42  
    43  * ShortLifeTimeFirst
    44  
    45    `ShortLifeTimeFirst`, abbreviated as `SLTF`, will sort workloads by running time. Pods with the shortest lift time 
    46  will be rescheduled first. This strategy can make sure workloads with long task type goes healthily without interruptions.
    47  
    48  * BigObjectFirst
    49  
    50      `BigObjectFirst`, abbreviated as `BOF`, will select workloads which request the most **dominate resource** and reschedule
    51  them first. It helps improve system throughout and avoid small workloads' starvation.
    52  
    53  * MoreReplicasFirst
    54    
    55      `MoreReplicasFirst`, abbreviated as `MRF`, will sort workloads by replicas number. Workloads with the most replicas
    56  will be rescheduled first. This strategy is friendly to `gang scheduling` for it will consider the `minAvailable` in 
    57  volcano jobs.
    58  
    59  * Others
    60      Implement the [Policy and Strategies](https://github.com/kubernetes-sigs/descheduler#policy-and-strategies) listed 
    61  for [Descheduler](https://github.com/kubernetes-sigs/descheduler)
    62  
    63  ### Metrics
    64  All the decisions made by rescheduling strategies will consider the metrics from `Prometheus`. Namely, Volcano will 
    65  list the real node resource utilization and pod distribution instead of request resource. Basically, usage of`CPU` and 
    66  `Memroy` will be collected. Other resource such as `GPU` can be extended.
    67  
    68  ## Implementation(Beta)
    69  ```yaml
    70  ## Configuration Option 
    71  actions: "enqueue, allocate, backfill, shuffle"  ## add 'shuffle' at the end of the actions
    72  tiers:
    73    - plugins:
    74        - name: priority
    75        - name: gang
    76        - name: conformance
    77        - name: rescheduling       ## rescheduling plugin
    78          arguments:
    79            interval: 5m           ## optional, the strategies will be called in this duration periodically. 5 minutes by default.
    80            metricsPeriod: 5m      ## optional, the metrics will be used during this plugin. 5 minutes by default.
    81            strategies:            ## required, strategies working in order
    82              - name: offlineOnly
    83              - name: lowPriorityFirst
    84              - name: lowNodeUtilization
    85                params:
    86                  thresholds:
    87                    "cpu" : 20
    88                    "memory": 20
    89                    "pods": 20
    90                  targetThresholds:
    91                    "cpu" : 50
    92                    "memory": 50
    93                    "pods": 50
    94            queueSelector:         ## optional, select workloads in specified queues as potential evictees. All queues by default.
    95              - default
    96              - test-queue
    97            labelSelector:         ## optional, select workloads with specified labels as potential evictees. All labels by default.
    98              business: offline
    99              team: test
   100    - plugins:
   101        - name: overcommit
   102        - name: drf
   103        - name: predicates
   104        - name: proportion
   105        - name: nodeorder
   106        - name: binpack
   107  ```
   108  
   109  Implementation Profile:
   110  * Load and parse user configurations about rescheduling.
   111  * Update the cache by metrics collected by `Prometheus`.
   112  * Create a new action named "shuffle", implement the workflow above.
   113  * Create a new plugin named "rescheduling", implement the strategies above.
   114  
   115  As the plan, we will implement the following functions in v1.6:
   116  * Support configuration resolution.
   117  * Implement 'shuffle' to support rescheduling process.
   118  * Implement 'rescheduling' plugin generally.
   119  * Implement [LowNodeUtilization](https://github.com/kubernetes-sigs/descheduler#lownodeutilization) strategy.
   120  * Implement the rescheduling process based on metrics provided by `Prometheus`.
   121  
   122  Functions to be implemented later:
   123  * Other strategies listed above.
   124  * Resource Filter
   125  
   126  ## TODO
   127  * Make sure pod rescheduled will not be scheduled to original node or other unfit nodes.
   128  
   129  ## Reference
   130  * [kubernetes-sigs/descheduler](https://github.com/kubernetes-sigs/descheduler)
   131  * [prometheus/prometheus](https://github.com/prometheus/prometheus)