github.com/gocrane/crane@v0.11.0/docs/tutorials/qos-interference-detection-and-active-avoidance.md (about) 1 ## Qos Ensurance Architecture 2 Qos ensurance's architecture is shown as below. It contains three modules. 3 4 1. state collector: collect metrics periodically 5 2. anomaly analyzer: analyze the node triggered anomaly used collected metrics 6 3. action executor: execute avoidance actions, include disable scheduling, throttle and eviction. 7 8  9 10 The main process: 11 12 1. State collector synchronizes policies from kube-apiserver. 13 2. If the policies are changed, the state collector updates the collectors. 14 3. State collector collects metrics periodically. 15 4. State collector transmits metrics to anomaly analyzer. 16 5. Anomaly analyzer ranges all rules to analyze the avoidance threshold or the restored threshold reached. 17 6. Anomaly analyzer merges the analyzed results and notices the avoidance actions. 18 7. Action executor executes actions based on the analyzed results. 19 20 ## Interference Detection and Active Avoidance 21 22 ### Related CR 23 AvoidanceAction mainly defines the operations that need to be performed after interference is detected, including several operations such as Disable Scheduling, throttle, and eviction, and defines some related parameters. 24 25 NodeQOS mainly defines the metrics collection method and parameters, the related parameters of the watermark, and the associated avoidance operation when metrics are abnormal. 26 At the same time, the above content is associated to the specified nodes through a series of selectors. 27 28 PodQOS defines the AvoidanceAction that a specified pod can be executed, and is usually paired with NodeQOS to limit the scope of execution actions from the dimensions of nodes and pods. 29 The selector supported by PodQOS includes label selector, and also supports filtering of specific QOSClass ("BestEffort", "Guaranteed", etc.), 30 specific Priority, and specific Namespace of pods, above selectors are associated with each other in the manner of "AND". 31 32 ### Disable Scheduling 33 34 The following AvoidanceAction and NodeQOS can be defined. As a result, when the node CPU usage triggers the threshold, disable schedule action for the node will be executed. 35 36 The sample YAML looks like below: 37 38 ```yaml 39 apiVersion: ensurance.crane.io/v1alpha1 40 kind: AvoidanceAction 41 metadata: 42 labels: 43 app: system 44 name: disablescheduling 45 spec: 46 description: disable schedule new pods to the node 47 coolDownSeconds: 300 # The minimum wait time of the node from scheduling disable status to normal status 48 ``` 49 50 ```yaml 51 apiVersion: ensurance.crane.io/v1alpha1 52 kind: NodeQOS 53 metadata: 54 name: "watermark1" 55 spec: 56 nodeQualityProbe: 57 timeoutSeconds: 10 58 nodeLocalGet: 59 localCacheTTLSeconds: 60 60 rules: 61 - name: "cpu-usage" 62 avoidanceThreshold: 2 #(1) 63 restoreThreshold: 2 #(2) 64 actionName: "disablescheduling" #(3) 65 strategy: "None" #(4) 66 metricRule: 67 name: "cpu_total_usage" #(5) 68 value: 4000 #(6) 69 ``` 70 71 1. We consider the rule is triggered, when the threshold reached continued so many times 72 2. We consider the rule is restored, when the threshold not reached continued so many times 73 3. Name of AvoidanceAction which be associated 74 4. Strategy for the action, you can set it "Preview" to not perform actually 75 5. Name of metric 76 6. Threshold of metric 77 78 ```yaml 79 apiVersion: ensurance.crane.io/v1alpha1 80 kind: PodQOS 81 metadata: 82 name: all-elastic-pods 83 spec: 84 allowedActions: 85 - disablescheduling 86 labelSelector: 87 matchLabels: 88 preemptible_job: "true" 89 ``` 90 91 1. The action allowed to be executed by the pod associated with the PodQOS is eviction 92 2. Associate pods with preemptible_job: "true" via label selector 93 94 Please check the video to learn more about the scheduling disable actions. 95 96 <script id="asciicast-480735" src="https://asciinema.org/a/480735.js" async></script> 97 98 ### Throttle 99 100 The following AvoidanceAction and NodeQOS can be defined. As a result, when the node CPU usage triggers the threshold, throttle action for the node will be executed. 101 102 The sample YAML looks like below: 103 104 ```yaml 105 apiVersion: ensurance.crane.io/v1alpha1 106 kind: AvoidanceAction 107 metadata: 108 name: throttle 109 labels: 110 app: system 111 spec: 112 coolDownSeconds: 300 113 throttle: 114 cpuThrottle: 115 minCPURatio: 10 #(1) 116 stepCPURatio: 10 #(2) 117 description: "throttle low priority pods" 118 ``` 119 120 1. The minimal ratio of the CPU quota, if the pod is throttled lower than this ratio, it will be set to this. 121 2. The step for throttle action. It will reduce this percentage of CPU quota in each avoidance triggered.It will increase this percentage of CPU quota in each restored. 122 123 ```yaml 124 apiVersion: ensurance.crane.io/v1alpha1 125 kind: NodeQOS 126 metadata: 127 name: "watermark2" 128 spec: 129 nodeQualityProbe: 130 timeoutSeconds: 10 131 nodeLocalGet: 132 localCacheTTLSeconds: 60 133 rules: 134 - name: "cpu-usage" 135 avoidanceThreshold: 2 136 restoredThreshold: 2 137 actionName: "throttle" 138 strategy: "None" 139 metricRule: 140 name: "cpu_total_usage" 141 value: 6000 142 ``` 143 144 ```yaml title="PodQOS" 145 apiVersion: ensurance.crane.io/v1alpha1 146 kind: PodQOS 147 metadata: 148 name: all-be-pods 149 spec: 150 allowedActions: 151 - throttle 152 scopeSelector: 153 matchExpressions: 154 - operator: In 155 scopeName: QOSClass 156 values: 157 - BestEffort 158 ``` 159 160 ### Eviction 161 162 The following YAML is another case, low priority pods on the node will be evicted, when the node CPU usage trigger the threshold. 163 164 ```yaml 165 apiVersion: ensurance.crane.io/v1alpha1 166 kind: AvoidanceAction 167 metadata: 168 name: eviction 169 labels: 170 app: system 171 spec: 172 coolDownSeconds: 300 173 eviction: 174 terminationGracePeriodSeconds: 30 #(1) 175 description: "evict low priority pods" 176 ``` 177 178 1. Duration in seconds the pod needs to terminate gracefully. 179 180 ```yaml 181 apiVersion: ensurance.crane.io/v1alpha1 182 kind: NodeQOS 183 metadata: 184 name: "watermark3" 185 labels: 186 app: "system" 187 spec: 188 nodeQualityProbe: 189 timeoutSeconds: 10 190 nodeLocalGet: 191 localCacheTTLSeconds: 60 192 rules: 193 - name: "cpu-usage" 194 avoidanceThreshold: 2 195 restoreThreshold: 2 196 actionName: "eviction" 197 strategy: "Preview" #(1) 198 metricRule: 199 name: "cpu_total_usage" 200 value: 6000 201 ``` 202 203 1. Strategy for the action, "Preview" to not perform actually 204 205 ```yaml title="PodQOS" 206 apiVersion: ensurance.crane.io/v1alpha1 207 kind: PodQOS 208 metadata: 209 name: all-elastic-pods 210 spec: 211 allowedActions: 212 - eviction 213 labelSelector: 214 matchLabels: 215 preemptible_job: "true" 216 ``` 217 218 219 ### Supported Metrics 220 221 Name | Description 222 ---------|------------- 223 cpu_total_usage | node cpu usage 224 cpu_total_utilization | node cpu utilization percent 225 memory_total_usage | node mem usage 226 memory_total_utilization| node mem utilization percent 227 228 For details, please refer to the examples under examples/ensurance. 229 230 ### Used with dynamic resources 231 In order to avoid the impact of active avoidance operations on high-priority services, such as the wrongful eviction of important services, 232 it is recommended to use PodQOS to associate workloads that use dynamic resources, so that only those workloads that use idle resources are affected when executing actions, 233 ensuring that The stability of the core business on the node. 234 235 For the content of dymamic resources, see qos-dynamic-resource-oversold-and-limit.md.