github.com/gocrane/crane@v0.11.0/docs/proposals/20220402-policy-based-abnomal-detection.md

github.com/gocrane/crane@v0.11.0/docs/proposals/20220402-policy-based-abnomal-detection.md (about)

     1  # Provide a policy-based abnormal detection mechanism in crane-agent
     2  
     3  ## Table of Contents
     4  
     5  ## Summary
     6  Crane-agent is responsible for detecting abnormality on nodes and interference between running pods.
     7  
     8  Currently, such detection mechanism is fixed and quite simple.
     9  Crane-agent compares the values of some pre-defined metrics, such as node's `cpu_total_usage` and `cpu_total_utilization`, 
    10  with some thresholds periodically. If the metric value is higher the threshold for some times, say the `cpu_total_utilization`
    11  on a node is found higher than 80% in 3 consecutive detections, crane-agent thinks the node entering into an abnormal status,
    12  and will perform some further actions, such as suppressing or evicting pods with low priorities.
    13  
    14  This proposal suggests a flexible and extensible way to detect abnormality. The criteria of abnormality can be customized by 
    15  users in form of policies, and the detection process is executed in a policy decision-making way, which is offloaded to a
    16  general-purpose policy engine.
    17  
    18  ## Motivation
    19  The criteria of abnormality or interference are not that always as simple as something like a metric value is higher than a threshold. 
    20  Different users may have different QoS requirements on different applications in different environments. The rule of
    21  abnormality detection varies, and it is impossible to implement all of them in code in advance.
    22  
    23  ### Goals
    24  1. Provides an abnormality detection mechanism which can consume external metrics.
    25  2. Provides an abnormality detection mechanism in which the logic determining how to check the abnormality can be customized.
    26  3. Metrics and detection policies can be added, updated and deleted on the fly without changing the code.
    27  
    28  ### Non-Goals
    29  1. How to handle the abnormality or interference. This proposal only focuses on detection, and the subsequent action is
    30  out of scope.
    31  
    32  
    33  ## Proposal
    34  
    35  ### User Stories
    36  
    37  #### Story 1
    38  A user has a critical online application which is latency sensitive running in the cluster, and he wants to use both 
    39  the 99th percentile response time and the error code rate as the application QoS indicators. If either of these 2 indicators 
    40  deteriorates, the application is thought of being in abnormal status.
    41  
    42  
    43  #### Story 2
    44  The SRE team finds that if the node CPU utilization is more than 60%, the QoS of some latency sensitive applications 
    45  running on it are likely to decline. So they want to keep the node CPU utilization lower than 60%. 
    46  If the utilization is higher than this threshold, the BE applications should be suppressed
    47  accordingly.
    48  
    49  
    50  #### Story 3
    51  The traffic of online applications is very low at night, and the offline jobs are run during this time.
    52  Comparing with online applications, offline jobs always require more CPU resource quantities but less resource qualities. 
    53  In this case, the SRE team wants to set different node CPU load thresholds in the daytime and at night.
    54  
    55  
    56  
    57  ### Functional Requirements
    58  
    59  ### Implementation Details
    60  #### API
    61  #####NodeQOSEnsurancePolicy
    62  ```yaml
    63  apiVersion: ensurance.crane.io/v1alpha1
    64  kind: NodeQOSEnsurancePolicy
    65  metadata:
    66    name: "xxx"
    67  spec:
    68    nodeQualityProbe:
    69      prometheus:
    70        targets: ['localhost:9090']
    71        queryInterval: 60s
    72        metrics:
    73          - name: node_cpu_utilization
    74            query: 1 - avg(irate(node_cpu_seconds_total{mode="idle", instance="$nodeName"}[5m]))
    75        variables:
    76          - name: nodeName
    77            valueFrom:
    78              fieldRef:
    79                fieldPath: spec.nodeName
    80    objectiveEnsurances:
    81      - name: "ext_cpu_total_distribute"
    82        avoidanceThreshold: 2
    83        restoreThreshold: 2
    84        actionName: "disablescheduling"
    85        policy: |
    86          default abnormal = false
    87          hour := time.clock([time.now_ns(), "Local"])[0]
    88          abnormal {
    89            input.node_cpu_utilization > 0.6
    90            hour >= 7, hour < 21
    91          }
    92          abnormal {
    93            input.node_cpu_utilization > 0.8
    94            hour >= 21
    95          }
    96          abnormal {
    97            input.node_cpu_utilization > 0.8
    98            hour < 7
    99          }
   100  ```
   101  #####PodQOSEnsurancePolicy
   102  ```yaml
   103  apiVersion: ensurance.crane.io/v1alpha1
   104  kind: PodQOSEnsurancePolicy
   105  metadata:
   106    name: "xxx"
   107  spec:
   108    selector:
   109      matchLabels:
   110        app: test
   111    qualityProbe:
   112      prometheus:
   113        targets: ['localhost:9090']
   114        queryInterval: 60s
   115        metrics: 
   116          - name: test_app_p90_latency
   117            query: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket{pod=~"$podName", node="$nodeName"}[1m]))
   118        variables:
   119          - name: podName
   120            valueFrom:
   121              fieldRef:
   122                fieldPath: metadata.name
   123          - name: nodeName
   124            valueFrom:
   125              fieldRef:
   126                fieldPath: spec.nodeName
   127    objectiveEnsurances:
   128      - name: "ext_cpu_total_distribute"
   129        avoidanceThreshold: 2
   130        restoreThreshold: 2
   131        actionName: "disablescheduling"
   132        policy: |
   133          abnormal if test_app_p90_latency[_].value > 0.1
   134  ```
   135  In addition to `Prometheus`, other protocols, such as `Graphite`, `InfluxDB` can also
   136  be added in the future.
   137  
   138  #### Metrics
   139  #### Built-in metrics
   140  Currently, crane-agent collects a bunch of metrics(defined in `pkg/ensurance/collector/types/types.go`, e.g.
   141  `cpu_total_usage`).
   142  These metrics are collected by `nodelocal` and `cadvisor` collectors, both of which collects metrics every 10 seconds.
   143  
   144  We call these metrics as built-in metrics. Users can use built-in metrics in the policy without explicit setting, and
   145  crane-agent will pass their values to every request to policy engine.
   146  
   147  ##### External metrics (New)
   148  Crane-agent can also get `external` metrics by querying against prometheus servers. A new `prometheus` quality probe 
   149  will be added to CRDs `PodQOSEnsurancePolicy` and `NodeQOSEnsurancePolicy` as shown in above 2 example yamls.
   150  
   151  In [PodQOSEnsurancePolicy](#PodQOSEnsurancePolicy), `.spec.nodeQualityProbe.prometheus.metrics.query` is a promQL, which
   152  may includes some node variables (prefixed with `$`). In this case, crane-agent will use its node name
   153  to replace the variable `$nodeName`.
   154  
   155  
   156  In [PodQOSEnsurancePolicy](#PodQOSEnsurancePolicy), `.spec.qualityProbe.prometheus.metrics.query` is a promQL, which
   157  may includes some pod related variables (`$nodeName`, `$podName` in this example). Crane-agent will firstly 
   158  get all pods that match the `.spec.selector.matchLabels` on its node. Say two pods are selected, and 
   159  their names are `pod-1` and `pod-2`, and the node name is `node-1`. The replaced promQL will be
   160  ```
   161  histogram_quantile(0.9, rate(http_request_duration_seconds_bucket{pod=~"pod-1|pod-2", node="node-1"}[1m]))
   162  ```
   163  And 2 query results are expected to get returned, like:
   164  ```
   165  test_app_p90_latency{pod="pod-1", ...} 0.01
   166  test_app_p90_latency{pod="pod-2", ...} 0.01
   167  ```
   168  
   169  Simply speaking, variables in promQL help crane-agent only query metrics of its own node and the pods that running on its own node.
   170  
   171  
   172  
   173  #### Embedded metrics TSDB
   174  In order to decouple the components that collect metrics and those which consume the metrics, and to
   175  make these components' logic simple, an embedded metrics TSDB will be imported into crane-agent.
   176  
   177  Prometheus-tsdb and vmstorage are two good candidates, both of which are easy to insert values and are 
   178  compatible with promQL query grammar. 
   179  
   180  Both analyzer and executor fetch metrics from the TSDB without considering where the metrics come from.
   181  
   182  ![](images/tsdb.png)
   183  
   184  #### Policy
   185  The Open Policy Agent (OPA) is an open source, general-purpose policy engine that unifies policy enforcement.
   186  Crane-agent will use it to evaluate if nodes or pods are abnormal. 
   187  
   188  The criteria for detecting abnormality is not pre-defined or hardcoded, instead, it is customized by users
   189  at runtime.
   190  
   191  A `policy` filed will be added to `ObjectiveEnsurance`, which is 
   192  a [rego](https://www.openpolicyagent.org/docs/latest/policy-language/) rule whose result is a boolean
   193  element.
   194  
   195  crane-agent will feed both the latest built-in and external metrics as input into the OPA policy engine, and OPA
   196  make decisions based on input and policies.
   197  
   198  ![](images/opa.png)
   199  
   200  A sample input is as follows:
   201  ```json
   202  {
   203    "crane": {
   204      "cpu_total_usage": 4680,
   205      ... orhter built-in mertrcs
   206    },
   207    "test_app_p90_latency": [
   208      {
   209        "labels": {
   210          "pod": "pod-1",
   211          "node": "node-1"
   212        },
   213        "value": 0.1
   214      },
   215      {
   216        "labels": {
   217          "pod": "pod-2",
   218          "node": "node-1"
   219        },
   220        "value": 0.09
   221      }
   222    ],
   223    ...
   224  }
   225  ```
   226