volcano.sh/volcano@v1.9.0/docs/design/usage-based-scheduling.md (about) 1 # Usage based scheduling 2 @william-wang Feb 16 2022 3 4 ## Motivation 5 Currently the pod is scheduled based on the resource request and node allocatable resource other than the node usage. This leads to the unbalanced resource usage of compute nodes. Pod is scheduled to node with higher usage and lower allocation rate. This is not what users expect. Users expect the usage of each node to be balanced. 6 7 ## Scope 8 ### In scope 9 * Support node usaged based scheduling. 10 * Filter nodes whose usage is higher than usage threshold that user defined. 11 * Prioritize node with node usage and scheduling pod to node with low usage. 12 13 ### Out of Scope 14 * The resource oversubscription is not considered in this project. 15 * Node GPU resource usage is out of scope. 16 17 ## Design 18 19 ### Scheduler Cache 20 A separated goroutine is created in scheduler cache to talk with Metrics source(like prometheus, elasticsearch) which is used to collect and aggregate node usage metrics. The node usage data in cache is consumed by usage based scheduling plugin and other plugins like rescheduling plugin. The struct is as below. 21 ``` 22 type NodeUsage struct { 23 MetricsTime time.Time 24 cpuUsageAvg map[string]float64 25 memUsageAvg map[string]float64 26 } 27 28 type NodeInfo struct { 29 … 30 ResourceUsage NodeUsage 31 } 32 ``` 33 34 ### Usage based scheduling plugin 35 36 * PredictFn():Filter nodes whose usage is higher than usage threshold that user defined 37 * NodeOrder():Prioritize node with node real-time usage 38 * Preemptable():Pod whose node with lower usage is able to preempt pod whose nodes with higher usage 39 40 ### Scheduler Configuration 41 ``` 42 actions: "enqueue, allocate, backfill" 43 tiers: 44 - plugins: 45 - name: priority 46 - name: gang 47 - name: conformance 48 - name: usage # usage based scheduling plugin 49 enablePredicate: false # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled. 50 arguments: 51 usage.weight: 5 52 cpu.weight: 1 53 memory.weight: 1 54 thresholds: 55 cpu: 80 # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods. 56 mem: 70 # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods. 57 - plugins: 58 - name: overcommit 59 - name: drf 60 - name: predicates 61 - name: proportion 62 - name: nodeorder 63 - name: binpack 64 metrics: # metrics server related configuration 65 type: prometheus # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adapt" and "elasticsearch" 66 address: http://192.168.0.10:9090 # Mandatory, The metrics source address 67 interval: 30s # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default 68 tls: # Optional, The tls configuration 69 insecureSkipVerify: "false" # Optional, Skip the certificate verification, false by default 70 elasticsearch: # Optional, The elasticsearch configuration 71 index: "custom-index-name" # Optional, The elasticsearch index name, "metricbeat-*" by default 72 username: "" # Optional, The elasticsearch username 73 password: "" # Optional, The elasticsearch password 74 hostnameFieldName: "host.hostname" # Optional, The elasticsearch hostname field name, "host.hostname" by default 75 ``` 76 77 ### How to predicate node 78 The plugins allow user to configure the cpu and memory average threshold within 5m. 79 Any node whose usage is higher than the value of `CpuUsageAvg.5m` or `MemUsageAvg.5m` is filtered. If no threshold is configured, the node gets into priority stage. 80 5m average usage is a typical value, more threshold can be added in the future if needed. The key format `CpuUsageAvg.<period>` such as `CpuUsageAvg.1h` . 81 82 ### How to prioritize node 83 There are several factors need to consider while evaluating which node is the best to allocate pod firstly. The first factor is the node average usage in a period of time such as 5m. The node with the lowest usage gets the highest score with this factor. 84 85 The second factor is the node usage fluctuation curve in a period of time. 86 Suppose there are two nodes with similar usage, The usage of one node fluctuates over a wide range and the other one fluctuates over a narrow range like the `node1` in below tables. The `node1` has higher possibility to get a higher score than `node2`. This is useful to avoid the risk that node get overloaded in peak hours. 87 88 The third factor identified is the resource dimension. Take the below table as example. if there is pending pod which is a compute sensitive pod, it is more suitable to schedule it to `node2` with higher mem weight. DRF might be suitable to handle the case to calculate the cpu, mem, gpu share for pod and each node then make the best match. 89 90 Finally, there should a model to balance multiple factors with weight and calculate the final score for nodes. Only the cpu usage factor will be considered in the alpha version. 91 92 | factors | node1 | node2 | 93 | ---- | ---- | --- | 94 | usage | cpu 80% | cpu 78% | 95 | usage fluctuation curve | 5 | 40 | 96 | resource dimension | cpu 80%, mem 20%| cpu 20%, mem 80% | 97 | ... | ... | ... | 98 | | | | 99 100 ### Configuration and usage of different monitoring systems 101 The monitoring data of Volcano usage can be obtained from "Prometheus", "Custom Metrics API" and "Eleasticsearch", where the corresponding type of "Custom Metrics Api" is "prometheus_adapt". 102 103 **It is recommended to use the Custom Metrics API mode, and the monitoring indicators come from Prometheus Adapt.** 104 105 #### Custom Metrics API 106 Ensure that Prometheus Adaptor is properly installed in the cluster and the custom metrics API is available. 107 Set the user-defined indicator information. The rules to be added are as follows. For details, see [Metrics Discovery and Presentation Configuration](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md#metrics-discovery-and-presentation-configuration) 108 ``` 109 rules: 110 - seriesQuery: '{__name__=~"node_cpu_seconds_total"}' 111 resources: 112 overrides: 113 instance: 114 resource: node 115 name: 116 matches: "node_cpu_seconds_total" 117 as: "node_cpu_usage_avg" 118 metricsQuery: avg_over_time((1 - avg (irate(<<.Series>>{mode="idle"}[5m])) by (instance))[10m:30s]) 119 - seriesQuery: '{__name__=~"node_memory_MemTotal_bytes"}' 120 resources: 121 overrides: 122 instance: 123 resource: node 124 name: 125 matches: "node_memory_MemTotal_bytes" 126 as: "node_memory_usage_avg" 127 metricsQuery: avg_over_time(((1-node_memory_MemAvailable_bytes/<<.Series>>))[10m:30s]) 128 ``` 129 Scheduler Configuration: 130 ``` 131 actions: "enqueue, allocate, backfill" 132 tiers: 133 - plugins: 134 - name: priority 135 - name: gang 136 - name: conformance 137 - name: usage # usage based scheduling plugin 138 enablePredicate: false # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled. 139 arguments: 140 usage.weight: 5 141 cpu.weight: 1 142 memory.weight: 1 143 thresholds: 144 cpu: 80 # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods. 145 mem: 70 # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods. 146 - plugins: 147 - name: overcommit 148 - name: drf 149 - name: predicates 150 - name: proportion 151 - name: nodeorder 152 - name: binpack 153 metrics: # metrics server related configuration 154 type: prometheus_adaptor # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch" 155 interval: 30s # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default 156 ``` 157 158 #### Prometheus 159 Scheduler Configuration: 160 ``` 161 actions: "enqueue, allocate, backfill" 162 tiers: 163 - plugins: 164 - name: priority 165 - name: gang 166 - name: conformance 167 - name: usage # usage based scheduling plugin 168 enablePredicate: false # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled. 169 arguments: 170 usage.weight: 5 171 cpu.weight: 1 172 memory.weight: 1 173 thresholds: 174 cpu: 80 # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods. 175 mem: 70 # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods. 176 - plugins: 177 - name: overcommit 178 - name: drf 179 - name: predicates 180 - name: proportion 181 - name: nodeorder 182 - name: binpack 183 metrics: # metrics server related configuration 184 type: prometheus # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch" 185 address: http://192.168.0.10:9090 # Mandatory, The metrics source address 186 interval: 30s # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default 187 ``` 188 189 ### Elesticsearch 190 Scheduler Configuration 191 ``` 192 actions: "enqueue, allocate, backfill" 193 tiers: 194 - plugins: 195 - name: priority 196 - name: gang 197 - name: conformance 198 - name: usage # usage based scheduling plugin 199 enablePredicate: false # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled. 200 arguments: 201 usage.weight: 5 202 cpu.weight: 1 203 memory.weight: 1 204 thresholds: 205 cpu: 80 # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods. 206 mem: 70 # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods. 207 - plugins: 208 - name: overcommit 209 - name: drf 210 - name: predicates 211 - name: proportion 212 - name: nodeorder 213 - name: binpack 214 metrics: # metrics server related configuration 215 type: elasticsearch # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch" 216 address: http://192.168.0.10:9090 # Mandatory, The metrics source address 217 interval: 30s # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default 218 tls: # Optional, The tls configuration 219 insecureSkipVerify: "false" # Optional, Skip the certificate verification, false by default 220 elasticsearch: # Optional, The elasticsearch configuration 221 index: "custom-index-name" # Optional, The elasticsearch index name, "metricbeat-*" by default 222 username: "" # Optional, The elasticsearch username 223 password: "" # Optional, The elasticsearch password 224 hostnameFieldName: "host.hostname" # Optional, The elasticsearch hostname field name, "host.hostname" by default 225 ```