github.com/gocrane/crane@v0.11.0/docs/tutorials/qos-interference-detection-and-active-avoidance.md

github.com/gocrane/crane@v0.11.0/docs/tutorials/qos-interference-detection-and-active-avoidance.md (about)

     1  ## Qos Ensurance Architecture
     2  Qos ensurance's architecture is shown as below. It contains three modules.
     3  
     4  1. state collector: collect metrics periodically
     5  2. anomaly analyzer: analyze the node triggered anomaly used collected metrics
     6  3. action executor: execute avoidance actions, include disable scheduling, throttle and eviction.
     7  
     8  ![crane-qos-enurance](../images/crane-qos-ensurance.png)
     9  
    10  The main process:
    11  
    12  1. State collector synchronizes policies from kube-apiserver.
    13  2. If the policies are changed, the state collector updates the collectors.
    14  3. State collector collects metrics periodically.
    15  4. State collector transmits metrics to anomaly analyzer.
    16  5. Anomaly analyzer ranges all rules to analyze the avoidance threshold or the restored threshold reached.
    17  6. Anomaly analyzer merges the analyzed results and notices the avoidance actions.
    18  7. Action executor executes actions based on the analyzed results.
    19  
    20  ## Interference Detection and Active Avoidance
    21  
    22  ### Related CR
    23  AvoidanceAction mainly defines the operations that need to be performed after interference is detected, including several operations such as Disable Scheduling, throttle, and eviction, and defines some related parameters.
    24  
    25  NodeQOS mainly defines the metrics collection method and parameters, the related parameters of the watermark, and the associated avoidance operation when metrics are abnormal. 
    26  At the same time, the above content is associated to the specified nodes through a series of selectors.
    27  
    28  PodQOS defines the AvoidanceAction that a specified pod can be executed, and is usually paired with NodeQOS to limit the scope of execution actions from the dimensions of nodes and pods. 
    29  The selector supported by PodQOS includes label selector, and also supports filtering of specific QOSClass ("BestEffort", "Guaranteed", etc.), 
    30  specific Priority, and specific Namespace of pods, above selectors are associated with each other in the manner of "AND".
    31  
    32  ### Disable Scheduling
    33  
    34  The following AvoidanceAction and NodeQOS can be defined. As a result, when the node CPU usage triggers the threshold, disable schedule action for the node will be executed.
    35  
    36  The sample YAML looks like below:
    37  
    38  ```yaml
    39  apiVersion: ensurance.crane.io/v1alpha1
    40  kind: AvoidanceAction
    41  metadata:
    42    labels:
    43      app: system
    44    name: disablescheduling
    45  spec:
    46    description: disable schedule new pods to the node
    47    coolDownSeconds: 300  # The minimum wait time of the node from  scheduling disable status to normal status
    48  ```
    49  
    50  ```yaml
    51  apiVersion: ensurance.crane.io/v1alpha1
    52  kind: NodeQOS
    53  metadata:
    54    name: "watermark1"
    55  spec:
    56    nodeQualityProbe: 
    57      timeoutSeconds: 10
    58      nodeLocalGet:
    59        localCacheTTLSeconds: 60
    60    rules:
    61    - name: "cpu-usage"
    62      avoidanceThreshold: 2 #(1) 
    63      restoreThreshold: 2 #(2)
    64      actionName: "disablescheduling" #(3) 
    65      strategy: "None" #(4) 
    66      metricRule:
    67        name: "cpu_total_usage" #(5) 
    68        value: 4000 #(6) 
    69  ```
    70  
    71  1. We consider the rule is triggered, when the threshold reached continued so many times
    72  2. We consider the rule is restored, when the threshold not reached continued so many times
    73  3. Name of AvoidanceAction which be associated
    74  4. Strategy for the action, you can set it "Preview" to not perform actually
    75  5. Name of metric
    76  6. Threshold of metric
    77  
    78  ```yaml
    79  apiVersion: ensurance.crane.io/v1alpha1
    80  kind: PodQOS
    81  metadata:
    82    name: all-elastic-pods
    83  spec:
    84    allowedActions:
    85      - disablescheduling
    86    labelSelector:
    87      matchLabels:
    88        preemptible_job: "true"
    89  ```
    90  
    91  1. The action allowed to be executed by the pod associated with the PodQOS is eviction
    92  2. Associate pods with preemptible_job: "true" via label selector
    93  
    94  Please check the video to learn more about the scheduling disable actions.
    95  
    96  <script id="asciicast-480735" src="https://asciinema.org/a/480735.js" async></script>
    97  
    98  ### Throttle
    99  
   100  The following AvoidanceAction and NodeQOS can be defined. As a result, when the node CPU usage triggers the threshold, throttle action for the node will be executed.
   101  
   102  The sample YAML looks like below:
   103  
   104  ```yaml
   105  apiVersion: ensurance.crane.io/v1alpha1
   106  kind: AvoidanceAction
   107  metadata:
   108    name: throttle
   109    labels:
   110      app: system
   111  spec:
   112    coolDownSeconds: 300
   113    throttle:
   114      cpuThrottle:
   115        minCPURatio: 10 #(1)
   116        stepCPURatio: 10 #(2) 
   117    description: "throttle low priority pods"
   118  ```
   119  
   120  1. The minimal ratio of the CPU quota, if the pod is throttled lower than this ratio, it will be set to this.
   121  2. The step for throttle action. It will reduce this percentage of CPU quota in each avoidance triggered.It will increase this percentage of CPU quota in each restored.
   122  
   123  ```yaml
   124  apiVersion: ensurance.crane.io/v1alpha1
   125  kind: NodeQOS
   126  metadata:
   127    name: "watermark2"
   128  spec:
   129    nodeQualityProbe:
   130      timeoutSeconds: 10
   131      nodeLocalGet:
   132        localCacheTTLSeconds: 60
   133    rules:
   134      - name: "cpu-usage"
   135        avoidanceThreshold: 2
   136        restoredThreshold: 2
   137        actionName: "throttle"
   138        strategy: "None"
   139        metricRule:
   140          name: "cpu_total_usage"
   141          value: 6000
   142  ```
   143  
   144  ```yaml title="PodQOS"
   145  apiVersion: ensurance.crane.io/v1alpha1
   146  kind: PodQOS
   147  metadata:
   148    name: all-be-pods
   149  spec:
   150    allowedActions:
   151      - throttle
   152    scopeSelector:
   153      matchExpressions:
   154        - operator: In
   155          scopeName: QOSClass
   156          values:
   157            - BestEffort
   158  ```
   159  
   160  ### Eviction
   161  
   162  The following YAML is another case, low priority pods on the node will be evicted, when the node CPU usage trigger the threshold.
   163  
   164  ```yaml
   165  apiVersion: ensurance.crane.io/v1alpha1
   166  kind: AvoidanceAction
   167  metadata:
   168    name: eviction
   169    labels:
   170      app: system
   171  spec:
   172    coolDownSeconds: 300
   173    eviction:
   174      terminationGracePeriodSeconds: 30 #(1) 
   175    description: "evict low priority pods"
   176  ```
   177  
   178  1. Duration in seconds the pod needs to terminate gracefully.
   179  
   180  ```yaml
   181  apiVersion: ensurance.crane.io/v1alpha1
   182  kind: NodeQOS
   183  metadata:
   184    name: "watermark3"
   185    labels:
   186      app: "system"
   187  spec:
   188    nodeQualityProbe: 
   189      timeoutSeconds: 10
   190      nodeLocalGet:
   191        localCacheTTLSeconds: 60
   192    rules:
   193    - name: "cpu-usage"
   194      avoidanceThreshold: 2
   195      restoreThreshold: 2
   196      actionName: "eviction"
   197      strategy: "Preview" #(1) 
   198      metricRule:
   199        name: "cpu_total_usage"
   200        value: 6000
   201  ```
   202  
   203  1. Strategy for the action, "Preview" to not perform actually
   204  
   205  ```yaml title="PodQOS"
   206  apiVersion: ensurance.crane.io/v1alpha1
   207  kind: PodQOS
   208  metadata:
   209    name: all-elastic-pods
   210  spec:
   211    allowedActions:  
   212      - eviction  
   213    labelSelector:  
   214      matchLabels:
   215        preemptible_job: "true"
   216  ```
   217  
   218  
   219  ### Supported Metrics
   220  
   221  Name     | Description
   222  ---------|-------------
   223  cpu_total_usage | node cpu usage
   224  cpu_total_utilization | node cpu utilization percent
   225  memory_total_usage | node mem usage
   226  memory_total_utilization| node mem utilization percent
   227  
   228  For details, please refer to the examples under examples/ensurance.
   229  
   230  ### Used with dynamic resources
   231  In order to avoid the impact of active avoidance operations on high-priority services, such as the wrongful eviction of important services, 
   232  it is recommended to use PodQOS to associate workloads that use dynamic resources, so that only those workloads that use idle resources are affected when executing actions, 
   233  ensuring that The stability of the core business on the node.
   234  
   235  For the content of dymamic resources, see qos-dynamic-resource-oversold-and-limit.md.