github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/rules/_index.md

github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/rules/_index.md (about)

     1  ---
     2  aliases:
     3    - /alerting/
     4  title: Alerting and Recording Rules
     5  weight: 700
     6  ---
     7  
     8  # Rules and the Ruler
     9  
    10  Grafana Loki includes a component called the ruler. The ruler is responsible for continually evaluating a set of configurable queries and performing an action based on the result.
    11  
    12  This example configuration sources rules from a local disk.
    13  
    14  [Ruler storage](#ruler-storage) provides further details.
    15  
    16  ```yaml
    17  ruler:
    18    storage:
    19      type: local
    20      local:
    21        directory: /tmp/rules
    22    rule_path: /tmp/scratch
    23    alertmanager_url: http://localhost
    24    ring:
    25      kvstore:
    26        store: inmemory
    27    enable_api: true
    28  
    29  ```
    30  
    31  We support two kinds of rules: [alerting](#alerting-rules) rules and [recording](#recording-rules) rules.
    32  
    33  ## Alerting Rules
    34  
    35  We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) alerting rules. From Prometheus' documentation:
    36  
    37  > Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.
    38  
    39  Loki alerting rules are exactly the same, except they use LogQL for their expressions.
    40  
    41  ### Example
    42  
    43  A complete example of a rules file:
    44  
    45  ```yaml
    46  groups:
    47    - name: should_fire
    48      rules:
    49        - alert: HighPercentageError
    50          expr: |
    51            sum(rate({app="foo", env="production"} |= "error" [5m])) by (job)
    52              /
    53            sum(rate({app="foo", env="production"}[5m])) by (job)
    54              > 0.05
    55          for: 10m
    56          labels:
    57              severity: page
    58          annotations:
    59              summary: High request latency
    60    - name: credentials_leak
    61      rules: 
    62        - alert: http-credentials-leaked
    63          annotations: 
    64            message: "{{ $labels.job }} is leaking http basic auth credentials."
    65          expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
    66          for: 10m
    67          labels: 
    68            severity: critical
    69  ```
    70  
    71  ## Recording Rules
    72  
    73  We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules) recording rules. From Prometheus' documentation:
    74  
    75  > Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.
    76  
    77  > Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh.
    78  
    79  Loki allows you to run [metric queries](../logql/metric_queries) over your logs, which means
    80  that you can derive a numeric aggregation from your logs, like calculating the number of requests over time from your NGINX access log.
    81  
    82  ### Example
    83  
    84  ```yaml
    85  name: NginxRules
    86  interval: 1m
    87  rules:
    88    - record: nginx:requests:rate1m
    89      expr: |
    90        sum(
    91          rate({container="nginx"}[1m])
    92        )
    93      labels:
    94        cluster: "us-central1"
    95  ```
    96  
    97  This query (`expr`) will be executed every 1 minute (`interval`), the result of which will be stored in the metric
    98  name we have defined (`record`). This metric named `nginx:requests:rate1m` can now be sent to Prometheus, where it will be stored
    99  just like any other metric.
   100  
   101  ### Remote-Write
   102  
   103  With recording rules, you can run these metric queries continually on an interval, and have the resulting metrics written
   104  to a Prometheus-compatible remote-write endpoint. They produce Prometheus metrics from log entries.
   105  
   106  At the time of writing, these are the compatible backends that support this:
   107  
   108  - [Prometheus](https://prometheus.io/docs/prometheus/latest/disabled_features/#remote-write-receiver) (`>=v2.25.0`):
   109    Prometheus is generally a pull-based system, but since `v2.25.0` has allowed for metrics to be written directly to it as well.
   110  - [Grafana Mimir](https://grafana.com/docs/mimir/latest/operators-guide/reference-http-api/#remote-write)
   111  - [Thanos (`Receiver`)](https://thanos.io/tip/components/receive.md/)
   112  
   113  Here is an example remote-write configuration for sending to a local Prometheus instance:
   114  
   115  ```yaml
   116  ruler:
   117    ... other settings ...
   118    
   119    remote_write:
   120      enabled: true
   121      client:
   122        url: http://localhost:9090/api/v1/write
   123  ```
   124  
   125  Further configuration options can be found under [ruler](../configuration#ruler).
   126  
   127  ### Operations
   128  
   129  Please refer to the [Recording Rules](../operations/recording-rules/) page.
   130  
   131  ## Use cases
   132  
   133  The Ruler's Prometheus compatibility further accentuates the marriage between metrics and logs. For those looking to get started with metrics and alerts based on logs, or wondering why this might be useful, here are a few use cases we think fit very well.
   134  
   135  ### Black box monitoring
   136  
   137  We don't always control the source code of applications we run. Load balancers and a myriad of other components, both open source and closed third-party, support our applications while they don't expose the metrics we want. Some don't expose any metrics at all. Loki's alerting and recording rules can produce metrics and alert on the state of the system, bringing the components into our observability stack by using the logs. This is an incredibly powerful way to introduce advanced observability into legacy architectures.
   138  
   139  ### Event alerting
   140  
   141  Sometimes you want to know whether _any_ instance of something has occurred. Alerting based on logs can be a great way to handle this, such as finding examples of leaked authentication credentials:
   142  ```yaml
   143  - name: credentials_leak
   144    rules: 
   145      - alert: http-credentials-leaked
   146        annotations: 
   147          message: "{{ $labels.job }} is leaking http basic auth credentials."
   148        expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
   149        for: 10m
   150        labels: 
   151          severity: critical
   152  ```
   153  
   154  ### Alerting on high-cardinality sources
   155  
   156  Another great use case is alerting on high cardinality sources. These are things which are difficult/expensive to record as metrics because the potential label set is huge. A great example of this is per-tenant alerting in multi-tenanted systems like Loki. It's a common balancing act between the desire to have per-tenant metrics and the cardinality explosion that ensues (adding a single _tenant_ label to an existing Prometheus metric would increase it's cardinality by the number of tenants).
   157  
   158  Creating these alerts in LogQL is attractive because these metrics can be extracted at _query time_, meaning we don't suffer the cardinality explosion in our metrics store.
   159  
   160  > **Note** As an example, we can use LogQL v2 to help Loki to monitor _itself_, alerting us when specific tenants have queries that take longer than 10s to complete! To do so, we'd use the following query: `sum by (org_id) (rate({job="loki-prod/query-frontend"} |= "metrics.go" | logfmt | duration > 10s [1m]))`
   161  
   162  ## Interacting with the Ruler
   163  
   164  Because the rule files are identical to Prometheus rule files, we can interact with the Loki Ruler via [`cortextool`](https://github.com/grafana/cortex-tools#rules). The CLI is in early development, but it works with both Loki and Cortex. Pass the `--backend=loki` option when using it with Loki.
   165  
   166  > **Note:** Not all commands in cortextool currently support Loki.
   167  
   168  > **Note:** cortextool was intended to run against multi-tenant Loki, commands need an `--id=` flag set to the Loki instance ID or set the environment variable `CORTEX_TENANT_ID`.  If Loki is running in single tenant mode, the required ID is `fake` (yes we know this might seem alarming but it's totally fine, no it can't be changed) 
   169  
   170  An example workflow is included below:
   171  
   172  ```sh
   173  # lint the rules.yaml file ensuring it's valid and reformatting it if necessary
   174  cortextool rules lint --backend=loki ./output/rules.yaml
   175  
   176  # diff rules against the currently managed ruleset in Loki
   177  cortextool rules diff --rule-dirs=./output --backend=loki
   178  
   179  # ensure the remote ruleset matches your local ruleset, creating/updating/deleting remote rules which differ from your local specification.
   180  cortextool rules sync --rule-dirs=./output --backend=loki
   181  
   182  # print the remote ruleset
   183  cortextool rules print --backend=loki
   184  ```
   185  
   186  There is also a [github action](https://github.com/grafana/cortex-rules-action) available for `cortex-tool`, so you can add it into your CI/CD pipelines!
   187  
   188  For instance, you can sync rules on master builds via
   189  ```yaml
   190  name: sync-cortex-rules-and-alerts
   191  on:
   192    push:
   193      branches:
   194        - master
   195  env:
   196    CORTEX_ADDRESS: '<fill me in>'
   197    CORTEX_TENANT_ID: '<fill me in>'
   198    CORTEX_API_KEY: ${{ secrets.API_KEY }}
   199    RULES_DIR: 'output/'
   200  jobs:
   201    sync-loki-alerts:
   202      runs-on: ubuntu-18.04
   203      steps:
   204        - name: Lint Rules
   205          uses: grafana/cortex-rules-action@v0.4.0
   206          env:
   207            ACTION: 'lint'
   208          with:
   209            args: --backend=loki
   210        - name: Diff rules
   211          uses: grafana/cortex-rules-action@v0.4.0
   212          env:
   213            ACTION: 'diff'
   214          with:
   215            args: --backend=loki
   216        - name: Sync rules
   217          if: ${{ !contains(steps.diff-rules.outputs.detailed, 'no changes detected') }}
   218          uses: grafana/cortex-rules-action@v0.4.0
   219          env:
   220            ACTION: 'sync'
   221          with:
   222            args: --backend=loki
   223        - name: Print rules
   224          uses: grafana/cortex-rules-action@v0.4.0
   225          env:
   226            ACTION: 'print'
   227  ```
   228  
   229  ## Scheduling and best practices
   230  
   231  One option to scale the Ruler is by scaling it horizontally. However, with multiple Ruler instances running they will need to coordinate to determine which instance will evaluate which rule. Similar to the ingesters, the Rulers establish a hash ring to divide up the responsibilities of evaluating rules.
   232  
   233  The possible configurations are listed fully in the [configuration documentation](../configuration/), but in order to shard rules across multiple Rulers, the rules API must be enabled via flag (`-ruler.enable-api`) or config file parameter. Secondly, the Ruler requires it's own ring be configured. From there the Rulers will shard and handle the division of rules automatically. Unlike ingesters, Rulers do not hand over responsibility: all rules are re-sharded randomly every time a Ruler is added to or removed from the ring.
   234  
   235  A full sharding-enabled Ruler example is:
   236  
   237  ```yaml
   238  ruler:
   239      alertmanager_url: <alertmanager_endpoint>
   240      enable_alertmanager_v2: true
   241      enable_api: true
   242      enable_sharding: true
   243      ring:
   244          kvstore:
   245              consul:
   246                  host: consul.loki-dev.svc.cluster.local:8500
   247              store: consul
   248      rule_path: /tmp/rules
   249      storage:
   250          gcs:
   251              bucket_name: <loki-rules-bucket>
   252  ```
   253  
   254  ## Ruler storage
   255  
   256  The Ruler supports five kinds of storage: azure, gcs, s3, swift, and local. Most kinds of storage work with the sharded Ruler configuration in an obvious way, i.e. configure all Rulers to use the same backend.
   257  
   258  The local implementation reads the rule files off of the local filesystem. This is a read-only backend that does not support the creation and deletion of rules through the [Ruler API](../api/#ruler). Despite the fact that it reads the local filesystem this method can still be used in a sharded Ruler configuration if the operator takes care to load the same rules to every Ruler. For instance, this could be accomplished by mounting a [Kubernetes ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) onto every Ruler pod.
   259  
   260  A typical local configuration might look something like:
   261  ```
   262    -ruler.storage.type=local
   263    -ruler.storage.local.directory=/tmp/loki/rules
   264  ```
   265  
   266  With the above configuration, the Ruler would expect the following layout:
   267  ```
   268  /tmp/loki/rules/<tenant id>/rules1.yaml
   269                             /rules2.yaml
   270  ```
   271  Yaml files are expected to be [Prometheus compatible](#Prometheus_Compatible) but include LogQL expressions as specified in the beginning of this doc.
   272  
   273  ## Future improvements
   274  
   275  There are a few things coming to increase the robustness of this service. In no particular order:
   276  
   277  - WAL for recording rule.
   278  - Backend metric stores adapters for generated alert rule data.
   279  
   280  ## Misc Details: Metrics backends vs in-memory
   281  
   282  Currently the Loki Ruler is decoupled from a backing Prometheus store. Generally, the result of evaluating rules as well as the history of the alert's state are stored as a time series. Loki is unable to store/retrieve these in order to allow it to run independently of i.e. Prometheus. As a workaround, Loki keeps a small in memory store whose purpose is to lazy load past evaluations when rescheduling or resharding Rulers. In the future, Loki will support optional metrics backends, allowing storage of these metrics for auditing & performance benefits.