volcano.sh/volcano@v1.9.0/docs/design/rescheduling.md (about) 1 # Rescheduling 2 3 @[Thor-wl](https://github.com/Thor-wl); Dec 25th, 2021 4 5 ## Motivation 6 As what [Issue1777](https://github.com/volcano-sh/volcano/issues/1777) describes, **Rescheduling** is important for the 7 following reasons: 8 * Unbalanced resource utilization due to unreasonable scheduling strategies and dynamic changes in jobs' lifecycle. 9 * Node status changes such as add/remove nodes, pod/node taint/affinity changes. 10 11 In order to rebalance the cluster resource utilization among nodes, we want to achieve these goals: 12 * Rescheduling pods based on real resource utilization instead of request resource. 13 * Support custom configured rescheduling strategies. 14 15 ## Design 16 ### WorkFlow 17 1. Filter Resources which are selected to be evicted potentially according to the filter chain, for example, queue filter 18 and label filter. 19 2. Get through the chain of rescheduling strategies and filter resources which are to be evicted. 20 3. Evict the Pods attached to these resources. 21 4. Execute the process above periodically. 22 23 ### Resource Filter 24 * Queue Filter 25 26 `Queue filtering` will filter resources in specified queue. Then the rescheduling process only works on the result set. 27 28 * Label Filter 29 30 `Label Filter` will filter pods with specified labels. Then the rescheduling process only works on the result set. 31 32 ### Rescheduling Strategy 33 * OfflineOnly 34 35 `OfflineOnly`, abbreviated as `OLO`, will only pick out offline workloads, which are attached with annotation 36 `preemptable: true`, and then reschedule them only. 37 38 * LowPriorityFirst 39 40 `LowPriorityFirst`, abbreviated as `LPF`, will sort workloads by priority and then reschedule pods in ascending 41 order. 42 43 * ShortLifeTimeFirst 44 45 `ShortLifeTimeFirst`, abbreviated as `SLTF`, will sort workloads by running time. Pods with the shortest lift time 46 will be rescheduled first. This strategy can make sure workloads with long task type goes healthily without interruptions. 47 48 * BigObjectFirst 49 50 `BigObjectFirst`, abbreviated as `BOF`, will select workloads which request the most **dominate resource** and reschedule 51 them first. It helps improve system throughout and avoid small workloads' starvation. 52 53 * MoreReplicasFirst 54 55 `MoreReplicasFirst`, abbreviated as `MRF`, will sort workloads by replicas number. Workloads with the most replicas 56 will be rescheduled first. This strategy is friendly to `gang scheduling` for it will consider the `minAvailable` in 57 volcano jobs. 58 59 * Others 60 Implement the [Policy and Strategies](https://github.com/kubernetes-sigs/descheduler#policy-and-strategies) listed 61 for [Descheduler](https://github.com/kubernetes-sigs/descheduler) 62 63 ### Metrics 64 All the decisions made by rescheduling strategies will consider the metrics from `Prometheus`. Namely, Volcano will 65 list the real node resource utilization and pod distribution instead of request resource. Basically, usage of`CPU` and 66 `Memroy` will be collected. Other resource such as `GPU` can be extended. 67 68 ## Implementation(Beta) 69 ```yaml 70 ## Configuration Option 71 actions: "enqueue, allocate, backfill, shuffle" ## add 'shuffle' at the end of the actions 72 tiers: 73 - plugins: 74 - name: priority 75 - name: gang 76 - name: conformance 77 - name: rescheduling ## rescheduling plugin 78 arguments: 79 interval: 5m ## optional, the strategies will be called in this duration periodically. 5 minutes by default. 80 metricsPeriod: 5m ## optional, the metrics will be used during this plugin. 5 minutes by default. 81 strategies: ## required, strategies working in order 82 - name: offlineOnly 83 - name: lowPriorityFirst 84 - name: lowNodeUtilization 85 params: 86 thresholds: 87 "cpu" : 20 88 "memory": 20 89 "pods": 20 90 targetThresholds: 91 "cpu" : 50 92 "memory": 50 93 "pods": 50 94 queueSelector: ## optional, select workloads in specified queues as potential evictees. All queues by default. 95 - default 96 - test-queue 97 labelSelector: ## optional, select workloads with specified labels as potential evictees. All labels by default. 98 business: offline 99 team: test 100 - plugins: 101 - name: overcommit 102 - name: drf 103 - name: predicates 104 - name: proportion 105 - name: nodeorder 106 - name: binpack 107 ``` 108 109 Implementation Profile: 110 * Load and parse user configurations about rescheduling. 111 * Update the cache by metrics collected by `Prometheus`. 112 * Create a new action named "shuffle", implement the workflow above. 113 * Create a new plugin named "rescheduling", implement the strategies above. 114 115 As the plan, we will implement the following functions in v1.6: 116 * Support configuration resolution. 117 * Implement 'shuffle' to support rescheduling process. 118 * Implement 'rescheduling' plugin generally. 119 * Implement [LowNodeUtilization](https://github.com/kubernetes-sigs/descheduler#lownodeutilization) strategy. 120 * Implement the rescheduling process based on metrics provided by `Prometheus`. 121 122 Functions to be implemented later: 123 * Other strategies listed above. 124 * Resource Filter 125 126 ## TODO 127 * Make sure pod rescheduled will not be scheduled to original node or other unfit nodes. 128 129 ## Reference 130 * [kubernetes-sigs/descheduler](https://github.com/kubernetes-sigs/descheduler) 131 * [prometheus/prometheus](https://github.com/prometheus/prometheus)