github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/best-practices/_index.md

github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/best-practices/_index.md (about)

1 ---
2 title: Best practices
3 weight: 400
4 ---
5 # Grafana Loki label best practices
6
7 Grafana Loki is under active development, and we are constantly working to improve performance. But here are some of the most current best practices for labels that will give you the best experience with Loki.
8
9 ## Static labels are good
10
11 Things like, host, application, and environment are great labels. They will be fixed for a given system/app and have bounded values. Use static labels to make it easier to query your logs in a logical sense (e.g. show me all the logs for a given application and specific environment, or show me all the logs for all the apps on a specific host).
12
13 ## Use dynamic labels sparingly
14
15 Too many label value combinations leads to too many streams. The penalties for that in Loki are a large index and small chunks in the store, which in turn can actually reduce performance.
16
17 To avoid those issues, don't add a label for something until you know you need it! Use filter expressions (`|= "text"`, `|~ "regex"`, …) and brute force those logs. It works -- and it's fast.
18
19 From early on, we have set a label dynamically using Promtail pipelines for `level`. This seemed intuitive for us as we often wanted to only show logs for `level="error"`; however, we are re-evaluating this now as writing a query. `{app="loki"} |= "level=error"` is proving to be just as fast for many of our applications as `{app="loki",level="error"}`.
20
21 This may seem surprising, but if applications have medium to low volume, that label causes one application's logs to be split into up to five streams, which means 5x chunks being stored. And loading chunks has an overhead associated with it. Imagine now if that query were `{app="loki",level!="debug"}`. That would have to load **way** more chunks than `{app="loki"} != "level=debug"`.
22
23 Above, we mentioned not to add labels until you _need_ them, so when would you _need_ labels?? A little farther down is a section on `chunk_target_size`. If you set this to 1MB (which is reasonable), this will try to cut chunks at 1MB compressed size, which is about 5MB-ish of uncompressed logs (might be as much as 10MB depending on compression). If your logs have sufficient volume to write 5MB in less time than `max_chunk_age`, or **many** chunks in that timeframe, you might want to consider splitting it into separate streams with a dynamic label.
24
25 What you want to avoid is splitting a log file into streams, which result in chunks getting flushed because the stream is idle or hits the max age before being full. As of [Loki 1.4.0](https://grafana.com/blog/2020/04/01/loki-v1.4.0-released-with-query-statistics-and-up-to-300x-regex-optimization/), there is a metric which can help you understand why chunks are flushed `sum by (reason) (rate(loki_ingester_chunks_flushed_total{cluster="dev"}[1m]))`.
26
27 It’s not critical that every chunk be full when flushed, but it will improve many aspects of operation. As such, our current guidance here is to avoid dynamic labels as much as possible and instead favor filter expressions. For example, don’t add a `level` dynamic label, just `|= "level=debug"` instead.
28
29 ## Label values must always be bounded
30
31 If you are dynamically setting labels, never use a label which can have unbounded or infinite values. This will always result in big problems for Loki.
32
33 Try to keep values bounded to as small a set as possible. We don't have perfect guidance as to what Loki can handle, but think single digits, or maybe 10’s of values for a dynamic label. This is less critical for static labels. For example, if you have 1,000 hosts in your environment it's going to be just fine to have a host label with 1,000 values.
34
35 ## Be aware of dynamic labels applied by clients
36
37 Loki has several client options: [Promtail](https://github.com/grafana/loki/tree/master/docs/sources/clients/promtail) (which also supports systemd journal ingestion and TCP-based syslog ingestion), [Fluentd](https://github.com/grafana/loki/tree/main/clients/cmd/fluentd), [Fluent Bit](https://github.com/grafana/loki/tree/main/clients/cmd/fluent-bit), a [Docker plugin](https://grafana.com/blog/2019/07/15/lokis-path-to-ga-docker-logging-driver-plugin-support-for-systemd/), and more!
38
39 Each of these come with ways to configure what labels are applied to create log streams. But be aware of what dynamic labels might be applied.
40 Use the Loki series API to get an idea of what your log streams look like and see if there might be ways to reduce streams and cardinality.
41 Series information can be queried through the [Series API](https://grafana.com/docs/loki/latest/api/#series), or you can use [logcli](https://grafana.com/docs/loki/latest/getting-started/logcli/).
42
43 In Loki 1.6.0 and newer the logcli series command added the `--analyze-labels` flag specifically for debugging high cardinality labels:
44
45 ```
46 Total Streams: 25017
47 Unique Labels: 8
48
49 Label Name Unique Values Found In Streams
50 requestId 24653 24979
51 logStream 1194 25016
52 logGroup 140 25016
53 accountId 13 25016
54 logger 1 25017
55 source 1 25016
56 transport 1 25017
57 format 1 25017
58 ```
59
60 In this example you can see the `requestId` label had a 24653 different values out of 24979 streams it was found in, this is bad!!
61
62 This is a perfect example of something which should not be a label, `requestId` should be removed as a label and instead
63 filter expressions should be used to query logs for a specific `requestId`. For example if `requestId` is found in
64 the log line as a key=value pair you could write a query like this: `{logGroup="group1"} |= "requestId=32422355"`
65
66 ## Configure caching
67
68 Loki can cache data at many levels, which can drastically improve performance. Details of this will be in a future post.
69
70 ## Time ordering of logs
71
72 Loki [accepts out-of-order writes](../configuration/#accept-out-of-order-writes) _by default_.
73 This section identifies best practices when Loki is _not_ configured to accept out-of-order writes.
74
75 One issue many people have with Loki is their client receiving errors for out of order log entries. This happens because of this hard and fast rule within Loki:
76
77 - For any single log stream, logs must always be sent in increasing time order. If a log is received with a timestamp older than the most recent log received for that stream, that log will be dropped.
78
79 There are a few things to dissect from that statement. The first is this restriction is per stream. Let’s look at an example:
80
81 ```
82 {job="syslog"} 00:00:00 i'm a syslog!
83 {job="syslog"} 00:00:01 i'm a syslog!
84 ```
85
86 If Loki received these two lines which are for the same stream, everything would be fine. But what about this case:
87
88 ```
89 {job="syslog"} 00:00:00 i'm a syslog!
90 {job="syslog"} 00:00:02 i'm a syslog!
91 {job="syslog"} 00:00:01 i'm a syslog! <- Rejected out of order!
92 ```
93
94 What can we do about this? What if this was because the sources of these logs were different systems? We can solve this with an additional label which is unique per system:
95
96 ```
97 {job="syslog", instance="host1"} 00:00:00 i'm a syslog!
98 {job="syslog", instance="host1"} 00:00:02 i'm a syslog!
99 {job="syslog", instance="host2"} 00:00:01 i'm a syslog! <- Accepted, this is a new stream!
100 {job="syslog", instance="host1"} 00:00:03 i'm a syslog! <- Accepted, still in order for stream 1
101 {job="syslog", instance="host2"} 00:00:02 i'm a syslog! <- Accepted, still in order for stream 2
102 ```
103
104 But what if the application itself generated logs that were out of order? Well, I'm afraid this is a problem. If you are extracting the timestamp from the log line with something like [the Promtail pipeline stage](https://grafana.com/docs/loki/latest/clients/promtail/stages/timestamp/), you could instead _not_ do this and let Promtail assign a timestamp to the log lines. Or you can hopefully fix it in the application itself.
105
106 It's also worth noting that the batching nature of the Loki push API can lead to some instances of out of order errors being received which are really false positives. (Perhaps a batch partially succeeded and was present; or anything that previously succeeded would return an out of order entry; or anything new would be accepted.)
107
108 ## Use `chunk_target_size`
109
110 Using `chunk_target_size` instructs Loki to try to fill all chunks to a target _compressed_ size of 1.5MB. These larger chunks are more efficient for Loki to process.
111
112 Other configuration variables affect how full a chunk can get. Loki has a default `max_chunk_age` of 1h and `chunk_idle_period` of 30m to limit the amount of memory used as well as the exposure of lost logs if the process crashes.
113
114 Depending on the compression used (we have been using snappy which has less compressibility but faster performance), you need 5-10x or 7.5-10MB of raw log data to fill a 1.5MB chunk. Remembering that a chunk is per stream, the more streams you break up your log files into, the more chunks that sit in memory, and the higher likelihood they get flushed by hitting one of those timeouts mentioned above before they are filled.
115
116 Lots of small, unfilled chunks negatively affect Loki. We are always working to improve this and may consider a compactor to improve this in some situations. But, in general, the guidance should stay about the same: try your best to fill chunks.
117
118 If you have an application that can log fast enough to fill these chunks quickly (much less than `max_chunk_age`), then it becomes more reasonable to use dynamic labels to break that up into separate streams.
119
120 ## Use `-print-config-stderr` or `-log-config-reverse-order`
121
122 Loki and Promtail have flags which will dump the entire config object to stderr or the log file when they start.
123
124 `-print-config-stderr` works well when invoking Loki from the command line, as you can get a quick output of the entire Loki configuration.
125
126 `-log-config-reverse-order` is the flag we run Loki with in all our environments. The configuration entries are reversed, so that the order of the configuration reads correctly top to bottom when viewed in Grafana's Explore.