github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/operations/loki-canary.md (about) 1 --- 2 title: Loki Canary 3 weight: 60 4 --- 5 # Loki Canary 6 7  8 9 Loki Canary is a standalone app that audits the log-capturing performance of 10 a Grafana Loki cluster. 11 12 Loki Canary generates artificial log lines. 13 These log lines are sent to the Loki cluster. 14 Loki Canary communicates with the Loki cluster to capture metrics about the 15 artificial log lines, 16 such that Loki Canary forms information about the performance of the 17 Loki cluster. 18 The information is available as Prometheus time series metrics. 19 20  21 22 Loki Canary writes a log to a file and stores the timestamp in an internal 23 array. The contents look something like this: 24 25 ```nohighlight 26 1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp 27 ``` 28 29 The relevant part of the log entry is the timestamp; the `p`s are just filler 30 bytes to make the size of the log configurable. 31 32 An agent (like Promtail) should be configured to read the log file and ship it 33 to Loki. 34 35 Meanwhile, Loki Canary will open a WebSocket connection to Loki and will tail 36 the logs it creates. When a log is received on the WebSocket, the timestamp 37 in the log message is compared to the internal array. 38 39 If the received log is: 40 41 - The next in the array to be received, it is removed from the array and the 42 (current time - log timestamp) is recorded in the `response_latency` 43 histogram. This is the expected behavior for well behaving logs. 44 - Not the next in the array to be received, it is removed from the array, the 45 response time is recorded in the `response_latency` histogram, and the 46 `out_of_order_entries` counter is incremented. 47 - Not in the array at all, it is checked against a separate list of received 48 logs to either increment the `duplicate_entries` counter or the 49 `unexpected_entries` counter. 50 51 In the background, Loki Canary also runs a timer which iterates through all of 52 the entries in the internal array. If any of the entries are older than the 53 duration specified by the `-wait` flag (defaulting to 60s), they are removed 54 from the array and the `websocket_missing_entries` counter is incremented. An 55 additional query is then made directly to Loki for any missing entries to 56 determine if they are truly missing or only missing from the WebSocket. If 57 missing entries are not found in the direct query, the `missing_entries` counter 58 is incremented. 59 60 ### Additional Queries 61 62 #### Spot Check 63 64 Starting with version 1.6.0, the canary will spot check certain results over time 65 to make sure they are present in Loki, this is helpful for testing the transition 66 of inmemory logs in the ingester to the store to make sure nothing is lost. 67 68 `-spot-check-interval` and `-spot-check-max` are used to tune this feature, 69 `-spot-check-interval` will pull a log entry from the stream at this interval 70 and save it in a separate list up to `-spot-check-max`. 71 72 Every `-spot-check-query-rate`, Loki will be queried for each entry in this list and 73 `loki_canary_spot_check_entries_total` will be incremented, if a result 74 is missing `loki_canary_spot_check_missing_entries_total` will be incremented. 75 76 The defaults of `15m` for `spot-check-interval` and `4h` for `spot-check-max` 77 means that after 4 hours of running the canary will have a list of 16 entries 78 it will query every minute (default `spot-check-query-rate` interval is 1m), 79 so be aware of the query load this can put on Loki if you have a lot of canaries. 80 81 __NOTE:__ if you are using `out-of-order-percentage` to test ingestion of out-of-order 82 log lines be sure not to set the two out of order time range flags too far in the past. 83 The defaults are already enough to test this functionality properly, and setting them 84 too far in the past can cause issues with the spot check test. 85 86 87 When using `out-of-order-percentage` you also need to make use of pipeline stages 88 in your Promtail configuration in order to set the timestamps correctly as the logs are pushed 89 to Loki. The `client/promtail/pipelines` docs have examples of how to do this. 90 91 #### Metric Test 92 93 Loki Canary will run a metric query `count_over_time` to 94 verify that the rate of logs being stored in Loki corresponds to the rate they are being 95 created by Loki Canary. 96 97 `-metric-test-interval` and `-metric-test-range` are used to tune this feature, but 98 by default every `15m` the canary will run a `count_over_time` instant-query to Loki 99 for a range of `24h`. 100 101 If the canary has not run for `-metric-test-range` (`24h`) the query range is adjusted 102 to the amount of time the canary has been running such that the rate can be calculated 103 since the canary was started. 104 105 The canary calculates what the expected count of logs would be for the range 106 (also adjusting this based on canary runtime) and compares the expected result with 107 the actual result returned from Loki. The _difference_ is stored as the value in 108 the gauge `loki_canary_metric_test_deviation` 109 110 It's expected that there will be some deviation, the method of creating an expected 111 calculation based on the query rate compared to actual query data is imperfect 112 and will lead to a deviation of a few log entries. 113 114 It's not expected for there to be a deviation of more than 3-4 log entries. 115 116 ### Control 117 118 Loki Canary responds to two endpoints to allow dynamic suspending/resuming of the 119 canary process. This can be useful if you'd like to quickly disable or reenable the 120 canary. To stop or start the canary issue an HTTP GET request against the `/suspend` or 121 `/resume` endpoints. 122 123 ## Installation 124 125 ### Binary 126 127 Loki Canary is provided as a pre-compiled binary as part of the 128 [Loki Releases](https://github.com/grafana/loki/releases) on GitHub. 129 130 ### Docker 131 132 Loki Canary is also provided as a Docker container image: 133 134 ```bash 135 # change tag to the most recent release 136 $ docker pull grafana/loki-canary:2.0.0 137 ``` 138 139 ### Kubernetes 140 141 To run on Kubernetes, you can do something simple like: 142 143 `kubectl run loki-canary --generator=run-pod/v1 144 --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=IfNotPresent 145 --labels=name=loki-canary -- -addr=loki:3100` 146 147 Or you can do something more complex like deploy it as a DaemonSet, there is a 148 Tanka setup for this in the `production` folder, you can import it using 149 `jsonnet-bundler`: 150 151 ```shell 152 jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary 153 ``` 154 155 Then in your Tanka environment's `main.jsonnet` you'll want something like 156 this: 157 158 ```jsonnet 159 local loki_canary = import 'loki-canary/loki-canary.libsonnet'; 160 161 loki_canary { 162 loki_canary_args+:: { 163 addr: "loki:3100", 164 port: 80, 165 labelname: "instance", 166 interval: "100ms", 167 size: 1024, 168 wait: "3m", 169 }, 170 _config+:: { 171 namespace: "default", 172 } 173 } 174 ``` 175 #### Examples 176 177 Standalone Pod Implementation of loki-canary 178 179 ``` 180 --- 181 apiVersion: v1 182 kind: Pod 183 metadata: 184 labels: 185 app: loki-canary 186 name: loki-canary 187 name: loki-canary 188 spec: 189 containers: 190 - args: 191 - -addr=loki:3100 192 image: grafana/loki-canary:latest 193 imagePullPolicy: IfNotPresent 194 name: loki-canary 195 resources: {} 196 --- 197 apiVersion: v1 198 kind: Service 199 metadata: 200 name: loki-canary 201 labels: 202 app: loki-canary 203 spec: 204 type: ClusterIP 205 selector: 206 app: loki-canary 207 ports: 208 - name: metrics 209 protocol: TCP 210 port: 3500 211 targetPort: 3500 212 ``` 213 214 DaemonSet Implementation of loki-canary 215 216 ``` 217 --- 218 kind: DaemonSet 219 apiVersion: extensions/v1beta1 220 metadata: 221 labels: 222 app: loki-canary 223 name: loki-canary 224 name: loki-canary 225 spec: 226 template: 227 metadata: 228 name: loki-canary 229 labels: 230 app: loki-canary 231 spec: 232 containers: 233 - args: 234 - -addr=loki:3100 235 image: grafana/loki-canary:latest 236 imagePullPolicy: IfNotPresent 237 name: loki-canary 238 resources: {} 239 --- 240 apiVersion: v1 241 kind: Service 242 metadata: 243 name: loki-canary 244 labels: 245 app: loki-canary 246 spec: 247 type: ClusterIP 248 selector: 249 app: loki-canary 250 ports: 251 - name: metrics 252 protocol: TCP 253 port: 3500 254 targetPort: 3500 255 ``` 256 257 258 ### From Source 259 260 If the other options are not sufficient for your use case, you can compile 261 `loki-canary` yourself: 262 263 ```bash 264 # clone the source tree 265 $ git clone https://github.com/grafana/loki 266 267 # build the binary 268 $ make loki-canary 269 270 # (optionally build the container image) 271 $ make loki-canary-image 272 ``` 273 274 ## Configuration 275 276 The address of Loki must be passed in with the `-addr` flag, and if your Loki 277 server uses TLS, `-tls=true` must also be provided. Note that using TLS will 278 cause the WebSocket connection to use `wss://` instead of `ws://`. 279 280 The `-labelname` and `-labelvalue` flags should also be provided, as these are 281 used by Loki Canary to filter the log stream to only process logs for the 282 current instance of the canary. Ensure that the values provided to the flags are 283 unique to each instance of Loki Canary. Grafana Labs' Tanka config 284 accomplishes this by passing in the pod name as the label value. 285 286 If Loki Canary reports a high number of `unexpected_entries`, Loki Canary may 287 not be waiting long enough and the value for the `-wait` flag should be 288 increased to a larger value than 60s. 289 290 __Be aware__ of the relationship between `pruneinterval` and the `interval`. 291 For example, with an interval of 10ms (100 logs per second) and a prune interval 292 of 60s, you will write 6000 logs per minute. If those logs were not received 293 over the WebSocket, the canary will attempt to query Loki directly to see if 294 they are completely lost. __However__ the query return is limited to 1000 295 results so you will not be able to return all the logs even if they did make it 296 to Loki. 297 298 __Likewise__, if you lower the `pruneinterval` you risk causing a denial of 299 service attack as all your canaries attempt to query for missing logs at 300 whatever your `pruneinterval` is defined at. 301 302 All options: 303 304 ``` 305 -addr string 306 The Loki server URL:Port, e.g. loki:3100 307 -buckets int 308 Number of buckets in the response_latency histogram (default 10) 309 -interval duration 310 Duration between log entries (default 1s) 311 -labelname string 312 The label name for this instance of Loki Canary to use in the log selector 313 (default "name") 314 -labelvalue string 315 The unique label value for this instance of Loki Canary to use in the log selector 316 (default "loki-canary") 317 -metric-test-interval duration 318 The interval the metric test query should be run (default 1h0m0s) 319 -metric-test-range duration 320 The range value [24h] used in the metric test instant-query. This value is truncated 321 to the running time of the canary until this value is reached (default 24h0m0s) 322 -out-of-order-max duration 323 Maximum amount of time (in seconds) in the past an out of order entry may have as a 324 timestamp. (default 60s) 325 -out-of-order-min duration 326 Minimum amount of time (in seconds) in the past an out of order entry may have as a 327 timestamp. (default 30s) 328 -out-of-order-percentage int 329 Percentage (0-100) of log entries that should be sent out of order 330 -pass string 331 Loki password 332 -port int 333 Port which Loki Canary should expose metrics (default 3500) 334 -pruneinterval duration 335 Frequency to check sent versus received logs, and also the frequency at which queries 336 for missing logs will be dispatched to Loki, and the frequency spot check queries are run 337 (default 1m0s) 338 -query-timeout duration 339 How long to wait for a query response from Loki (default 10s) 340 -size int 341 Size in bytes of each log line (default 100) 342 -spot-check-interval duration 343 Interval that a single result will be kept from sent entries and spot-checked against 344 Loki. For example, with the 15 minute default, one entry every 15 minutes will be saved, 345 and then queried again every 15 minutes until the time defined by spot-check-max is 346 reached (default 15m0s) 347 -spot-check-max duration 348 How far back to check a spot check an entry before dropping it (default 4h0m0s) 349 -spot-check-query-rate duration 350 Interval that Loki Canary will query Loki for the current list of all spot check entries 351 (default 1m0s) 352 -streamname string 353 The stream name for this instance of Loki Canary to use in the log selector 354 (default "stream") 355 -streamvalue string 356 The unique stream value for this instance of Loki Canary to use in the log selector 357 (default "stdout") 358 -tenant-id string 359 Tenant ID to be set in X-Scope-OrgID header. 360 -tls 361 Does the Loki connection use TLS? 362 -user string 363 Loki user name 364 -version 365 Print this build's version information 366 -wait duration 367 Duration to wait for log entries before reporting them as lost (default 1m0s) 368 ```