github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/rules/_index.md (about) 1 --- 2 aliases: 3 - /alerting/ 4 title: Alerting and Recording Rules 5 weight: 700 6 --- 7 8 # Rules and the Ruler 9 10 Grafana Loki includes a component called the ruler. The ruler is responsible for continually evaluating a set of configurable queries and performing an action based on the result. 11 12 This example configuration sources rules from a local disk. 13 14 [Ruler storage](#ruler-storage) provides further details. 15 16 ```yaml 17 ruler: 18 storage: 19 type: local 20 local: 21 directory: /tmp/rules 22 rule_path: /tmp/scratch 23 alertmanager_url: http://localhost 24 ring: 25 kvstore: 26 store: inmemory 27 enable_api: true 28 29 ``` 30 31 We support two kinds of rules: [alerting](#alerting-rules) rules and [recording](#recording-rules) rules. 32 33 ## Alerting Rules 34 35 We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) alerting rules. From Prometheus' documentation: 36 37 > Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service. 38 39 Loki alerting rules are exactly the same, except they use LogQL for their expressions. 40 41 ### Example 42 43 A complete example of a rules file: 44 45 ```yaml 46 groups: 47 - name: should_fire 48 rules: 49 - alert: HighPercentageError 50 expr: | 51 sum(rate({app="foo", env="production"} |= "error" [5m])) by (job) 52 / 53 sum(rate({app="foo", env="production"}[5m])) by (job) 54 > 0.05 55 for: 10m 56 labels: 57 severity: page 58 annotations: 59 summary: High request latency 60 - name: credentials_leak 61 rules: 62 - alert: http-credentials-leaked 63 annotations: 64 message: "{{ $labels.job }} is leaking http basic auth credentials." 65 expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)' 66 for: 10m 67 labels: 68 severity: critical 69 ``` 70 71 ## Recording Rules 72 73 We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules) recording rules. From Prometheus' documentation: 74 75 > Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. 76 77 > Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh. 78 79 Loki allows you to run [metric queries](../logql/metric_queries) over your logs, which means 80 that you can derive a numeric aggregation from your logs, like calculating the number of requests over time from your NGINX access log. 81 82 ### Example 83 84 ```yaml 85 name: NginxRules 86 interval: 1m 87 rules: 88 - record: nginx:requests:rate1m 89 expr: | 90 sum( 91 rate({container="nginx"}[1m]) 92 ) 93 labels: 94 cluster: "us-central1" 95 ``` 96 97 This query (`expr`) will be executed every 1 minute (`interval`), the result of which will be stored in the metric 98 name we have defined (`record`). This metric named `nginx:requests:rate1m` can now be sent to Prometheus, where it will be stored 99 just like any other metric. 100 101 ### Remote-Write 102 103 With recording rules, you can run these metric queries continually on an interval, and have the resulting metrics written 104 to a Prometheus-compatible remote-write endpoint. They produce Prometheus metrics from log entries. 105 106 At the time of writing, these are the compatible backends that support this: 107 108 - [Prometheus](https://prometheus.io/docs/prometheus/latest/disabled_features/#remote-write-receiver) (`>=v2.25.0`): 109 Prometheus is generally a pull-based system, but since `v2.25.0` has allowed for metrics to be written directly to it as well. 110 - [Grafana Mimir](https://grafana.com/docs/mimir/latest/operators-guide/reference-http-api/#remote-write) 111 - [Thanos (`Receiver`)](https://thanos.io/tip/components/receive.md/) 112 113 Here is an example remote-write configuration for sending to a local Prometheus instance: 114 115 ```yaml 116 ruler: 117 ... other settings ... 118 119 remote_write: 120 enabled: true 121 client: 122 url: http://localhost:9090/api/v1/write 123 ``` 124 125 Further configuration options can be found under [ruler](../configuration#ruler). 126 127 ### Operations 128 129 Please refer to the [Recording Rules](../operations/recording-rules/) page. 130 131 ## Use cases 132 133 The Ruler's Prometheus compatibility further accentuates the marriage between metrics and logs. For those looking to get started with metrics and alerts based on logs, or wondering why this might be useful, here are a few use cases we think fit very well. 134 135 ### Black box monitoring 136 137 We don't always control the source code of applications we run. Load balancers and a myriad of other components, both open source and closed third-party, support our applications while they don't expose the metrics we want. Some don't expose any metrics at all. Loki's alerting and recording rules can produce metrics and alert on the state of the system, bringing the components into our observability stack by using the logs. This is an incredibly powerful way to introduce advanced observability into legacy architectures. 138 139 ### Event alerting 140 141 Sometimes you want to know whether _any_ instance of something has occurred. Alerting based on logs can be a great way to handle this, such as finding examples of leaked authentication credentials: 142 ```yaml 143 - name: credentials_leak 144 rules: 145 - alert: http-credentials-leaked 146 annotations: 147 message: "{{ $labels.job }} is leaking http basic auth credentials." 148 expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)' 149 for: 10m 150 labels: 151 severity: critical 152 ``` 153 154 ### Alerting on high-cardinality sources 155 156 Another great use case is alerting on high cardinality sources. These are things which are difficult/expensive to record as metrics because the potential label set is huge. A great example of this is per-tenant alerting in multi-tenanted systems like Loki. It's a common balancing act between the desire to have per-tenant metrics and the cardinality explosion that ensues (adding a single _tenant_ label to an existing Prometheus metric would increase it's cardinality by the number of tenants). 157 158 Creating these alerts in LogQL is attractive because these metrics can be extracted at _query time_, meaning we don't suffer the cardinality explosion in our metrics store. 159 160 > **Note** As an example, we can use LogQL v2 to help Loki to monitor _itself_, alerting us when specific tenants have queries that take longer than 10s to complete! To do so, we'd use the following query: `sum by (org_id) (rate({job="loki-prod/query-frontend"} |= "metrics.go" | logfmt | duration > 10s [1m]))` 161 162 ## Interacting with the Ruler 163 164 Because the rule files are identical to Prometheus rule files, we can interact with the Loki Ruler via [`cortextool`](https://github.com/grafana/cortex-tools#rules). The CLI is in early development, but it works with both Loki and Cortex. Pass the `--backend=loki` option when using it with Loki. 165 166 > **Note:** Not all commands in cortextool currently support Loki. 167 168 > **Note:** cortextool was intended to run against multi-tenant Loki, commands need an `--id=` flag set to the Loki instance ID or set the environment variable `CORTEX_TENANT_ID`. If Loki is running in single tenant mode, the required ID is `fake` (yes we know this might seem alarming but it's totally fine, no it can't be changed) 169 170 An example workflow is included below: 171 172 ```sh 173 # lint the rules.yaml file ensuring it's valid and reformatting it if necessary 174 cortextool rules lint --backend=loki ./output/rules.yaml 175 176 # diff rules against the currently managed ruleset in Loki 177 cortextool rules diff --rule-dirs=./output --backend=loki 178 179 # ensure the remote ruleset matches your local ruleset, creating/updating/deleting remote rules which differ from your local specification. 180 cortextool rules sync --rule-dirs=./output --backend=loki 181 182 # print the remote ruleset 183 cortextool rules print --backend=loki 184 ``` 185 186 There is also a [github action](https://github.com/grafana/cortex-rules-action) available for `cortex-tool`, so you can add it into your CI/CD pipelines! 187 188 For instance, you can sync rules on master builds via 189 ```yaml 190 name: sync-cortex-rules-and-alerts 191 on: 192 push: 193 branches: 194 - master 195 env: 196 CORTEX_ADDRESS: '<fill me in>' 197 CORTEX_TENANT_ID: '<fill me in>' 198 CORTEX_API_KEY: ${{ secrets.API_KEY }} 199 RULES_DIR: 'output/' 200 jobs: 201 sync-loki-alerts: 202 runs-on: ubuntu-18.04 203 steps: 204 - name: Lint Rules 205 uses: grafana/cortex-rules-action@v0.4.0 206 env: 207 ACTION: 'lint' 208 with: 209 args: --backend=loki 210 - name: Diff rules 211 uses: grafana/cortex-rules-action@v0.4.0 212 env: 213 ACTION: 'diff' 214 with: 215 args: --backend=loki 216 - name: Sync rules 217 if: ${{ !contains(steps.diff-rules.outputs.detailed, 'no changes detected') }} 218 uses: grafana/cortex-rules-action@v0.4.0 219 env: 220 ACTION: 'sync' 221 with: 222 args: --backend=loki 223 - name: Print rules 224 uses: grafana/cortex-rules-action@v0.4.0 225 env: 226 ACTION: 'print' 227 ``` 228 229 ## Scheduling and best practices 230 231 One option to scale the Ruler is by scaling it horizontally. However, with multiple Ruler instances running they will need to coordinate to determine which instance will evaluate which rule. Similar to the ingesters, the Rulers establish a hash ring to divide up the responsibilities of evaluating rules. 232 233 The possible configurations are listed fully in the [configuration documentation](../configuration/), but in order to shard rules across multiple Rulers, the rules API must be enabled via flag (`-ruler.enable-api`) or config file parameter. Secondly, the Ruler requires it's own ring be configured. From there the Rulers will shard and handle the division of rules automatically. Unlike ingesters, Rulers do not hand over responsibility: all rules are re-sharded randomly every time a Ruler is added to or removed from the ring. 234 235 A full sharding-enabled Ruler example is: 236 237 ```yaml 238 ruler: 239 alertmanager_url: <alertmanager_endpoint> 240 enable_alertmanager_v2: true 241 enable_api: true 242 enable_sharding: true 243 ring: 244 kvstore: 245 consul: 246 host: consul.loki-dev.svc.cluster.local:8500 247 store: consul 248 rule_path: /tmp/rules 249 storage: 250 gcs: 251 bucket_name: <loki-rules-bucket> 252 ``` 253 254 ## Ruler storage 255 256 The Ruler supports five kinds of storage: azure, gcs, s3, swift, and local. Most kinds of storage work with the sharded Ruler configuration in an obvious way, i.e. configure all Rulers to use the same backend. 257 258 The local implementation reads the rule files off of the local filesystem. This is a read-only backend that does not support the creation and deletion of rules through the [Ruler API](../api/#ruler). Despite the fact that it reads the local filesystem this method can still be used in a sharded Ruler configuration if the operator takes care to load the same rules to every Ruler. For instance, this could be accomplished by mounting a [Kubernetes ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) onto every Ruler pod. 259 260 A typical local configuration might look something like: 261 ``` 262 -ruler.storage.type=local 263 -ruler.storage.local.directory=/tmp/loki/rules 264 ``` 265 266 With the above configuration, the Ruler would expect the following layout: 267 ``` 268 /tmp/loki/rules/<tenant id>/rules1.yaml 269 /rules2.yaml 270 ``` 271 Yaml files are expected to be [Prometheus compatible](#Prometheus_Compatible) but include LogQL expressions as specified in the beginning of this doc. 272 273 ## Future improvements 274 275 There are a few things coming to increase the robustness of this service. In no particular order: 276 277 - WAL for recording rule. 278 - Backend metric stores adapters for generated alert rule data. 279 280 ## Misc Details: Metrics backends vs in-memory 281 282 Currently the Loki Ruler is decoupled from a backing Prometheus store. Generally, the result of evaluating rules as well as the history of the alert's state are stored as a time series. Loki is unable to store/retrieve these in order to allow it to run independently of i.e. Prometheus. As a workaround, Loki keeps a small in memory store whose purpose is to lazy load past evaluations when rescheduling or resharding Rulers. In the future, Loki will support optional metrics backends, allowing storage of these metrics for auditing & performance benefits.