volcano.sh/volcano@v1.9.0/docs/design/usage-based-scheduling.md

volcano.sh/volcano@v1.9.0/docs/design/usage-based-scheduling.md (about)

     1  # Usage based scheduling
     2  @william-wang Feb 16 2022
     3  
     4  ## Motivation
     5  Currently the pod is scheduled based on the resource request and node allocatable resource other than the node usage. This leads to the unbalanced resource usage of compute nodes. Pod is scheduled to node with higher usage and lower allocation rate. This is not what users expect. Users expect the usage of each node to be balanced.
     6  
     7  ## Scope
     8  ### In scope
     9  * Support node usaged based scheduling.
    10  * Filter nodes whose usage is higher than usage threshold that user defined.
    11  * Prioritize node with node usage and scheduling pod to node with low usage.
    12  
    13  ### Out of Scope
    14  * The resource oversubscription is not considered in this project.
    15  * Node GPU resource usage is out of scope.
    16  
    17  ## Design 
    18  
    19  ### Scheduler Cache
    20  A separated goroutine is created in scheduler cache to talk with Metrics source(like prometheus, elasticsearch) which is used to collect and aggregate node usage metrics. The node usage data in cache is consumed by usage based scheduling plugin and other plugins like rescheduling plugin. The struct is as below. 
    21  ```
    22  type NodeUsage struct {
    23      MetricsTime time.Time
    24      cpuUsageAvg map[string]float64
    25      memUsageAvg map[string]float64
    26  }
    27  
    28  type NodeInfo struct {
    29      …
    30      ResourceUsage NodeUsage
    31  }
    32  ```
    33  
    34  ### Usage based scheduling plugin
    35  
    36  * PredictFn()：Filter nodes whose usage is higher than usage threshold that user defined
    37  * NodeOrder()：Prioritize node with node real-time usage
    38  * Preemptable()：Pod whose node with lower usage is able to preempt pod whose nodes with higher usage
    39  
    40  ### Scheduler Configuration
    41  ```
    42  actions: "enqueue, allocate, backfill"  
    43  tiers:
    44    - plugins:
    45        - name: priority
    46        - name: gang
    47        - name: conformance
    48        - name: usage  # usage based scheduling plugin
    49          enablePredicate: false  # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled.
    50          arguments:
    51            usage.weight: 5
    52            cpu.weight: 1
    53            memory.weight: 1
    54            thresholds:
    55              cpu: 80    # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods.
    56              mem: 70    # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods.
    57    - plugins:
    58        - name: overcommit
    59        - name: drf
    60        - name: predicates
    61        - name: proportion
    62        - name: nodeorder
    63        - name: binpack
    64  metrics:                               # metrics server related configuration
    65    type: prometheus                     # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adapt" and "elasticsearch"
    66    address: http://192.168.0.10:9090    # Mandatory, The metrics source address
    67    interval: 30s                        # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default
    68    tls:                                 # Optional, The tls configuration
    69      insecureSkipVerify: "false"        # Optional, Skip the certificate verification, false by default
    70    elasticsearch:                       # Optional, The elasticsearch configuration
    71      index: "custom-index-name"         # Optional, The elasticsearch index name, "metricbeat-*" by default
    72      username: ""                       # Optional, The elasticsearch username
    73      password: ""                       # Optional, The elasticsearch password
    74      hostnameFieldName: "host.hostname" # Optional, The elasticsearch hostname field name, "host.hostname" by default
    75    ```
    76  
    77  ### How to predicate node
    78  The plugins allow user to configure the cpu and memory average threshold within 5m.
    79  Any node whose usage is higher than the value of `CpuUsageAvg.5m` or `MemUsageAvg.5m` is filtered. If no threshold is configured, the node gets into priority stage.
    80  5m average usage is a typical value, more threshold can be added in the future if needed. The key format `CpuUsageAvg.<period>` such as `CpuUsageAvg.1h` . 
    81  
    82  ### How to prioritize node
    83  There are several factors need to consider while evaluating which node is the best to allocate pod firstly. The first factor is the node average usage in a period of time such as 5m. The node with the lowest usage gets the highest score with this factor. 
    84  
    85  The second factor is the node usage fluctuation curve in a period of time.
    86  Suppose there are two nodes with similar usage, The usage of one node fluctuates over a wide range and the other one fluctuates over a narrow range like the `node1` in below tables. The `node1` has higher possibility to get a higher score than `node2`. This is useful to avoid the risk that node get overloaded in peak hours.
    87  
    88  The third factor identified is the resource dimension. Take the below table as example. if there is pending pod which is a compute sensitive pod, it is more suitable to schedule it to `node2` with higher mem weight. DRF might be suitable to handle the case to calculate the cpu, mem, gpu share for pod and each node then make the best match.
    89  
    90  Finally, there should a model to balance multiple factors with weight and calculate the final score for nodes. Only the cpu usage factor will be considered in the alpha version.
    91  
    92  | factors                   | node1           | node2            |
    93  | ----                      | ----            | ---              |
    94  | usage                     | cpu 80%         | cpu 78%          |
    95  | usage fluctuation curve   | 5               | 40               |
    96  | resource dimension        | cpu 80%, mem 20%| cpu 20%, mem 80% |
    97  | ...                       |   ...           |    ...           |
    98  |                           |                 |                  |
    99  
   100  ### Configuration and usage of different monitoring systems
   101  The monitoring data of Volcano usage can be obtained from "Prometheus", "Custom Metrics API" and "Eleasticsearch", where the corresponding type of "Custom Metrics Api" is "prometheus_adapt".
   102  
   103  **It is recommended to use the Custom Metrics API mode, and the monitoring indicators come from Prometheus Adapt.**
   104  
   105  #### Custom Metrics API
   106  Ensure that Prometheus Adaptor is properly installed in the cluster and the custom metrics API is available.
   107  Set the user-defined indicator information. The rules to be added are as follows. For details, see [Metrics Discovery and Presentation Configuration](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md#metrics-discovery-and-presentation-configuration)
   108  ```
   109  rules:
   110      - seriesQuery: '{__name__=~"node_cpu_seconds_total"}'
   111        resources:
   112          overrides:
   113            instance:
   114              resource: node
   115        name:
   116          matches: "node_cpu_seconds_total"
   117          as: "node_cpu_usage_avg"
   118        metricsQuery: avg_over_time((1 - avg (irate(<<.Series>>{mode="idle"}[5m])) by (instance))[10m:30s])
   119      - seriesQuery: '{__name__=~"node_memory_MemTotal_bytes"}'
   120        resources:
   121          overrides:
   122            instance:
   123              resource: node
   124        name:
   125          matches: "node_memory_MemTotal_bytes"
   126          as: "node_memory_usage_avg"
   127        metricsQuery: avg_over_time(((1-node_memory_MemAvailable_bytes/<<.Series>>))[10m:30s])
   128  ```
   129  Scheduler Configuration:
   130  ```
   131  actions: "enqueue, allocate, backfill"  
   132  tiers:
   133    - plugins:
   134        - name: priority
   135        - name: gang
   136        - name: conformance
   137        - name: usage  # usage based scheduling plugin
   138          enablePredicate: false  # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled.
   139          arguments:
   140            usage.weight: 5
   141            cpu.weight: 1
   142            memory.weight: 1
   143            thresholds:
   144              cpu: 80    # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods.
   145              mem: 70    # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods.
   146    - plugins:
   147        - name: overcommit
   148        - name: drf
   149        - name: predicates
   150        - name: proportion
   151        - name: nodeorder
   152        - name: binpack
   153  metrics:                               # metrics server related configuration
   154    type: prometheus_adaptor               # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch"
   155    interval: 30s                        # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default
   156    ```
   157  
   158  #### Prometheus
   159  Scheduler Configuration:
   160  ```
   161  actions: "enqueue, allocate, backfill"  
   162  tiers:
   163    - plugins:
   164        - name: priority
   165        - name: gang
   166        - name: conformance
   167        - name: usage  # usage based scheduling plugin
   168          enablePredicate: false  # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled.
   169          arguments:
   170            usage.weight: 5
   171            cpu.weight: 1
   172            memory.weight: 1
   173            thresholds:
   174              cpu: 80    # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods.
   175              mem: 70    # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods.
   176    - plugins:
   177        - name: overcommit
   178        - name: drf
   179        - name: predicates
   180        - name: proportion
   181        - name: nodeorder
   182        - name: binpack
   183  metrics:                               # metrics server related configuration
   184    type: prometheus                     # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch"
   185    address: http://192.168.0.10:9090    # Mandatory, The metrics source address
   186    interval: 30s                        # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default
   187    ```
   188  
   189  ### Elesticsearch
   190  Scheduler Configuration
   191  ```
   192  actions: "enqueue, allocate, backfill"  
   193  tiers:
   194    - plugins:
   195        - name: priority
   196        - name: gang
   197        - name: conformance
   198        - name: usage  # usage based scheduling plugin
   199          enablePredicate: false  # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled.
   200          arguments:
   201            usage.weight: 5
   202            cpu.weight: 1
   203            memory.weight: 1
   204            thresholds:
   205              cpu: 80    # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods.
   206              mem: 70    # The actual Memory load of a node reaches 70%, and the node cannot schedule new pods.
   207    - plugins:
   208        - name: overcommit
   209        - name: drf
   210        - name: predicates
   211        - name: proportion
   212        - name: nodeorder
   213        - name: binpack
   214  metrics:                               # metrics server related configuration
   215    type: elasticsearch                  # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch"
   216    address: http://192.168.0.10:9090    # Mandatory, The metrics source address
   217    interval: 30s                        # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default
   218    tls:                                 # Optional, The tls configuration
   219      insecureSkipVerify: "false"        # Optional, Skip the certificate verification, false by default
   220    elasticsearch:                       # Optional, The elasticsearch configuration
   221      index: "custom-index-name"         # Optional, The elasticsearch index name, "metricbeat-*" by default
   222      username: ""                       # Optional, The elasticsearch username
   223      password: ""                       # Optional, The elasticsearch password
   224      hostnameFieldName: "host.hostname" # Optional, The elasticsearch hostname field name, "host.hostname" by default
   225    ```