github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/operations/loki-canary.md (about)

     1  ---
     2  title: Loki Canary
     3  weight: 60
     4  ---
     5  # Loki Canary
     6  
     7  ![canary](../canary.png)
     8  
     9  Loki Canary is a standalone app that audits the log-capturing performance of
    10  a Grafana Loki cluster.
    11  
    12  Loki Canary generates artificial log lines.
    13  These log lines are sent to the Loki cluster.
    14  Loki Canary communicates with the Loki cluster to capture metrics about the
    15  artificial log lines,
    16  such that Loki Canary forms information about the performance of the
    17  Loki cluster.
    18  The information is available as Prometheus time series metrics.
    19  
    20  ![block_diagram](../loki-canary-block.png)
    21  
    22  Loki Canary writes a log to a file and stores the timestamp in an internal
    23  array. The contents look something like this:
    24  
    25  ```nohighlight
    26  1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
    27  ```
    28  
    29  The relevant part of the log entry is the timestamp; the `p`s are just filler
    30  bytes to make the size of the log configurable.
    31  
    32  An agent (like Promtail) should be configured to read the log file and ship it
    33  to Loki.
    34  
    35  Meanwhile, Loki Canary will open a WebSocket connection to Loki and will tail
    36  the logs it creates. When a log is received on the WebSocket, the timestamp
    37  in the log message is compared to the internal array.
    38  
    39  If the received log is:
    40  
    41  - The next in the array to be received, it is removed from the array and the
    42    (current time - log timestamp) is recorded in the `response_latency`
    43    histogram. This is the expected behavior for well behaving logs.
    44  - Not the next in the array to be received, it is removed from the array, the
    45    response time is recorded in the `response_latency` histogram, and the
    46    `out_of_order_entries` counter is incremented.
    47  - Not in the array at all, it is checked against a separate list of received
    48    logs to either increment the `duplicate_entries` counter or the
    49    `unexpected_entries` counter.
    50  
    51  In the background, Loki Canary also runs a timer which iterates through all of
    52  the entries in the internal array. If any of the entries are older than the
    53  duration specified by the `-wait` flag (defaulting to 60s), they are removed
    54  from the array and the `websocket_missing_entries` counter is incremented. An
    55  additional query is then made directly to Loki for any missing entries to
    56  determine if they are truly missing or only missing from the WebSocket. If
    57  missing entries are not found in the direct query, the `missing_entries` counter
    58  is incremented.
    59  
    60  ### Additional Queries
    61  
    62  #### Spot Check
    63  
    64  Starting with version 1.6.0, the canary will spot check certain results over time
    65  to make sure they are present in Loki, this is helpful for testing the transition
    66  of inmemory logs in the ingester to the store to make sure nothing is lost.
    67  
    68  `-spot-check-interval` and `-spot-check-max` are used to tune this feature,
    69  `-spot-check-interval` will pull a log entry from the stream at this interval
    70  and save it in a separate list up to `-spot-check-max`.
    71  
    72  Every `-spot-check-query-rate`, Loki will be queried for each entry in this list and
    73  `loki_canary_spot_check_entries_total` will be incremented, if a result
    74  is missing `loki_canary_spot_check_missing_entries_total` will be incremented.
    75  
    76  The defaults of `15m` for `spot-check-interval` and `4h` for `spot-check-max`
    77  means that after 4 hours of running the canary will have a list of 16 entries
    78  it will query every minute (default `spot-check-query-rate` interval is 1m),
    79  so be aware of the query load this can put on Loki if you have a lot of canaries.
    80  
    81  __NOTE:__ if you are using `out-of-order-percentage` to test ingestion of out-of-order
    82  log lines be sure not to set the two out of order time range flags too far in the past.
    83  The defaults are already enough to test this functionality properly, and setting them
    84  too far in the past can cause issues with the spot check test.
    85  
    86  
    87  When using `out-of-order-percentage` you also need to make use of pipeline stages
    88  in your Promtail configuration in order to set the timestamps correctly as the logs are pushed
    89  to Loki. The `client/promtail/pipelines` docs have examples of how to do this.
    90  
    91  #### Metric Test
    92  
    93  Loki Canary will run a metric query `count_over_time` to
    94  verify that the rate of logs being stored in Loki corresponds to the rate they are being
    95  created by Loki Canary.
    96  
    97  `-metric-test-interval` and `-metric-test-range` are used to tune this feature, but
    98  by default every `15m` the canary will run a `count_over_time` instant-query to Loki
    99  for a range of `24h`.
   100  
   101  If the canary has not run for `-metric-test-range` (`24h`) the query range is adjusted
   102  to the amount of time the canary has been running such that the rate can be calculated
   103  since the canary was started.
   104  
   105  The canary calculates what the expected count of logs would be for the range
   106  (also adjusting this based on canary runtime) and compares the expected result with
   107  the actual result returned from Loki.  The _difference_ is stored as the value in
   108  the gauge `loki_canary_metric_test_deviation`
   109  
   110  It's expected that there will be some deviation, the method of creating an expected
   111  calculation based on the query rate compared to actual query data is imperfect
   112  and will lead to a deviation of a few log entries.
   113  
   114  It's not expected for there to be a deviation of more than 3-4 log entries.
   115  
   116  ### Control
   117  
   118  Loki Canary responds to two endpoints to allow dynamic suspending/resuming of the
   119  canary process.  This can be useful if you'd like to quickly disable or reenable the
   120  canary.  To stop or start the canary issue an HTTP GET request against the `/suspend` or
   121  `/resume` endpoints.
   122  
   123  ## Installation
   124  
   125  ### Binary
   126  
   127  Loki Canary is provided as a pre-compiled binary as part of the
   128  [Loki Releases](https://github.com/grafana/loki/releases) on GitHub.
   129  
   130  ### Docker
   131  
   132  Loki Canary is also provided as a Docker container image:
   133  
   134  ```bash
   135  # change tag to the most recent release
   136  $ docker pull grafana/loki-canary:2.0.0
   137  ```
   138  
   139  ### Kubernetes
   140  
   141  To run on Kubernetes, you can do something simple like:
   142  
   143  `kubectl run loki-canary --generator=run-pod/v1
   144  --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=IfNotPresent
   145  --labels=name=loki-canary -- -addr=loki:3100`
   146  
   147  Or you can do something more complex like deploy it as a DaemonSet, there is a
   148  Tanka setup for this in the `production` folder, you can import it using
   149  `jsonnet-bundler`:
   150  
   151  ```shell
   152  jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary
   153  ```
   154  
   155  Then in your Tanka environment's `main.jsonnet` you'll want something like
   156  this:
   157  
   158  ```jsonnet
   159  local loki_canary = import 'loki-canary/loki-canary.libsonnet';
   160  
   161  loki_canary {
   162    loki_canary_args+:: {
   163      addr: "loki:3100",
   164      port: 80,
   165      labelname: "instance",
   166      interval: "100ms",
   167      size: 1024,
   168      wait: "3m",
   169    },
   170    _config+:: {
   171      namespace: "default",
   172    }
   173  }
   174  ```
   175  #### Examples
   176  
   177  Standalone Pod Implementation of loki-canary
   178  
   179  ```
   180  ---
   181  apiVersion: v1
   182  kind: Pod
   183  metadata:
   184    labels:
   185      app: loki-canary
   186      name: loki-canary
   187    name: loki-canary
   188  spec:
   189    containers:
   190    - args:
   191      - -addr=loki:3100
   192      image: grafana/loki-canary:latest
   193      imagePullPolicy: IfNotPresent
   194      name: loki-canary
   195      resources: {}
   196  ---
   197  apiVersion: v1
   198  kind: Service
   199  metadata:
   200    name: loki-canary
   201    labels:
   202      app: loki-canary
   203  spec:
   204    type: ClusterIP
   205    selector:
   206      app: loki-canary
   207    ports:
   208    - name: metrics
   209      protocol: TCP
   210      port: 3500
   211      targetPort: 3500
   212  ```
   213  
   214  DaemonSet Implementation of loki-canary
   215  
   216  ```
   217  ---
   218  kind: DaemonSet
   219  apiVersion: extensions/v1beta1
   220  metadata:
   221    labels:
   222      app: loki-canary
   223      name: loki-canary
   224    name: loki-canary
   225  spec:
   226    template:
   227      metadata:
   228        name: loki-canary
   229        labels:
   230          app: loki-canary
   231      spec:
   232        containers:
   233        - args:
   234          - -addr=loki:3100
   235          image: grafana/loki-canary:latest
   236          imagePullPolicy: IfNotPresent
   237          name: loki-canary
   238          resources: {}
   239  ---
   240  apiVersion: v1
   241  kind: Service
   242  metadata:
   243    name: loki-canary
   244    labels:
   245      app: loki-canary
   246  spec:
   247    type: ClusterIP
   248    selector:
   249      app: loki-canary
   250    ports:
   251    - name: metrics
   252      protocol: TCP
   253      port: 3500
   254      targetPort: 3500
   255  ```
   256  
   257  
   258  ### From Source
   259  
   260  If the other options are not sufficient for your use case, you can compile
   261  `loki-canary` yourself:
   262  
   263  ```bash
   264  # clone the source tree
   265  $ git clone https://github.com/grafana/loki
   266  
   267  # build the binary
   268  $ make loki-canary
   269  
   270  # (optionally build the container image)
   271  $ make loki-canary-image
   272  ```
   273  
   274  ## Configuration
   275  
   276  The address of Loki must be passed in with the `-addr` flag, and if your Loki
   277  server uses TLS, `-tls=true` must also be provided. Note that using TLS will
   278  cause the WebSocket connection to use `wss://` instead of `ws://`.
   279  
   280  The `-labelname` and `-labelvalue` flags should also be provided, as these are
   281  used by Loki Canary to filter the log stream to only process logs for the
   282  current instance of the canary. Ensure that the values provided to the flags are
   283  unique to each instance of Loki Canary. Grafana Labs' Tanka config
   284  accomplishes this by passing in the pod name as the label value.
   285  
   286  If Loki Canary reports a high number of `unexpected_entries`, Loki Canary may
   287  not be waiting long enough and the value for the `-wait` flag should be
   288  increased to a larger value than 60s.
   289  
   290  __Be aware__ of the relationship between `pruneinterval` and the `interval`.
   291  For example, with an interval of 10ms (100 logs per second) and a prune interval
   292  of 60s, you will write 6000 logs per minute. If those logs were not received
   293  over the WebSocket, the canary will attempt to query Loki directly to see if
   294  they are completely lost. __However__ the query return is limited to 1000
   295  results so you will not be able to return all the logs even if they did make it
   296  to Loki.
   297  
   298  __Likewise__, if you lower the `pruneinterval` you risk causing a denial of
   299  service attack as all your canaries attempt to query for missing logs at
   300  whatever your `pruneinterval` is defined at.
   301  
   302  All options:
   303  
   304  ```
   305    -addr string
   306          The Loki server URL:Port, e.g. loki:3100
   307    -buckets int
   308          Number of buckets in the response_latency histogram (default 10)
   309    -interval duration
   310          Duration between log entries (default 1s)
   311    -labelname string
   312          The label name for this instance of Loki Canary to use in the log selector
   313          (default "name")
   314    -labelvalue string
   315          The unique label value for this instance of Loki Canary to use in the log selector
   316          (default "loki-canary")
   317    -metric-test-interval duration
   318          The interval the metric test query should be run (default 1h0m0s)
   319    -metric-test-range duration
   320          The range value [24h] used in the metric test instant-query. This value is truncated
   321          to the running time of the canary until this value is reached (default 24h0m0s)
   322    -out-of-order-max duration
   323      	  Maximum amount of time (in seconds) in the past an out of order entry may have as a
   324            timestamp. (default 60s)
   325    -out-of-order-min duration
   326      	  Minimum amount of time (in seconds) in the past an out of order entry may have as a
   327            timestamp. (default 30s)
   328    -out-of-order-percentage int
   329        	Percentage (0-100) of log entries that should be sent out of order
   330    -pass string
   331          Loki password
   332    -port int
   333          Port which Loki Canary should expose metrics (default 3500)
   334    -pruneinterval duration
   335          Frequency to check sent versus received logs, and also the frequency at which queries
   336          for missing logs will be dispatched to Loki, and the frequency spot check queries are run
   337          (default 1m0s)
   338    -query-timeout duration
   339          How long to wait for a query response from Loki (default 10s)
   340    -size int
   341          Size in bytes of each log line (default 100)
   342    -spot-check-interval duration
   343          Interval that a single result will be kept from sent entries and spot-checked against
   344          Loki. For example, with the 15 minute default, one entry every 15 minutes will be saved,
   345          and then queried again every 15 minutes until the time defined by spot-check-max is
   346          reached (default 15m0s)
   347    -spot-check-max duration
   348          How far back to check a spot check an entry before dropping it (default 4h0m0s)
   349    -spot-check-query-rate duration
   350          Interval that Loki Canary will query Loki for the current list of all spot check entries
   351          (default 1m0s)
   352    -streamname string
   353          The stream name for this instance of Loki Canary to use in the log selector
   354          (default "stream")
   355    -streamvalue string
   356          The unique stream value for this instance of Loki Canary to use in the log selector
   357          (default "stdout")
   358    -tenant-id string
   359          Tenant ID to be set in X-Scope-OrgID header.
   360    -tls
   361          Does the Loki connection use TLS?
   362    -user string
   363          Loki user name
   364    -version
   365          Print this build's version information
   366    -wait duration
   367          Duration to wait for log entries before reporting them as lost (default 1m0s)
   368  ```