github.com/outbrain/consul@v1.4.5/website/source/docs/guides/monitoring-telegraf.html.md

github.com/outbrain/consul@v1.4.5/website/source/docs/guides/monitoring-telegraf.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Monitoring Consul with Telegraf"
     4  sidebar_current: "docs-guides-monitoring-telegraf"
     5  description: |-
     6    Best practice approaches for monitoring a production Consul cluster with Telegraf
     7  ---
     8  
     9  # Monitoring Consul with Telegraf
    10  
    11  Consul makes available a range of metrics in various formats in order to measure the health and stability of a cluster, and diagnose or predict potential issues.
    12  
    13  There are number of monitoring tools and options, but for the purposes of this guide we are going to use the [telegraf_plugin][] in conjunction with the Statsd protocol supported by Consul.
    14  
    15  You can read the full breakdown of metrics with Consul in the [telemetry documentation](/docs/agent/telemetry.html)
    16  
    17  ## Configuring Telegraf
    18  
    19  # Installing Telegraf
    20  
    21  Installing Telegraf is straightforward on most Linux distributions. We recommend following the [official Telegraf installation documentation][telegraf-install].
    22  
    23  # Configuring Telegraf
    24  
    25  Besides acting as a statsd agent, Telegraf can collect additional metrics about the host that the Consul agent is running on. Telegraf itself ships with a wide range of [input plugins][telegraf-input-plugins] to collect data from lots of sources for this purpose.
    26  
    27  We're going to enable some of the most common ones to monitor CPU, memory, disk I/O, networking, and process status, as these are useful for debugging Consul cluster issues.
    28  
    29  The `telegraf.conf` file starts with global options:
    30  
    31  ```ini
    32  [agent]
    33    interval = "10s"
    34    flush_interval = "10s"
    35    omit_hostname = false
    36  ```
    37  
    38  We set the default collection interval to 10 seconds and ask Telegraf to include a `host` tag in each metric.
    39  
    40  As mentioned above, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, we are adding tags for the server role and datacenter. We can then use these tags in Grafana to filter queries (for example, to create a dashboard showing only servers with the `consul-server` role, or only servers in the `us-east-1` datacenter).
    41  
    42  ```ini
    43  [global_tags]
    44    role = "consul-server"
    45    datacenter = "us-east-1"
    46  ```
    47  
    48  Next, we set up a statsd listener on UDP port 8125, with instructions to calculate percentile metrics and to
    49  parse DogStatsD-compatible tags, when they're sent:
    50  
    51  ```ini
    52  [[inputs.statsd]]
    53    protocol = "udp"
    54    service_address = ":8125"
    55    delete_gauges = true
    56    delete_counters = true
    57    delete_sets = true
    58    delete_timings = true
    59    percentiles = [90]
    60    metric_separator = "_"
    61    parse_data_dog_tags = true
    62    allowed_pending_messages = 10000
    63    percentile_limit = 1000
    64  ```
    65  
    66  The full reference to all the available statsd-related options in Telegraf is [here][telegraf-statsd-input].
    67  
    68  Now we can configure inputs for things like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure the `interfaces` list in `inputs.net` matches the interface names you see in `ifconfig`.
    69  
    70  ```ini
    71  [[inputs.cpu]]
    72    percpu = true
    73    totalcpu = true
    74    collect_cpu_time = false
    75  
    76  [[inputs.disk]]
    77    # mount_points = ["/"]
    78    # ignore_fs = ["tmpfs", "devtmpfs"]
    79  
    80  [[inputs.diskio]]
    81    # devices = ["sda", "sdb"]
    82    # skip_serial_number = false
    83  
    84  [[inputs.kernel]]
    85    # no configuration
    86  
    87  [[inputs.linux_sysctl_fs]]
    88    # no configuration
    89  
    90  [[inputs.mem]]
    91    # no configuration
    92  
    93  [[inputs.net]]
    94    interfaces = ["enp0s*"]
    95  
    96  [[inputs.netstat]]
    97    # no configuration
    98  
    99  [[inputs.processes]]
   100    # no configuration
   101  
   102  [[inputs.swap]]
   103    # no configuration
   104  
   105  [[inputs.system]]
   106    # no configuration
   107  ```
   108  
   109  Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which reports metrics for processes you select:
   110  
   111  ```ini
   112  [[inputs.procstat]]
   113    pattern = "(consul)"
   114  ```
   115  
   116  Telegraf even includes a [plugin][telegraf-consul-input] that monitors the health checks associated with the Consul agent, using Consul API to query the data.
   117  
   118  It's important to note: the plugin itself will not report the telemetry, Consul will report those stats already using StatsD protocol.
   119  
   120  ```ini
   121  [[inputs.consul]]
   122    address = "localhost:8500"
   123    scheme = "http"
   124  ```
   125  
   126  ## Telegraf Configuration for Consul
   127  
   128  Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry` section to your agent configuration:
   129  
   130  ```json
   131  {
   132    "telemetry": {
   133      "dogstatsd_addr": "localhost:8125",
   134      "disable_hostname": true
   135    }
   136  }
   137  ```
   138  
   139  As you can see, we only need to specify two options. The `dogstatsd_addr` specifies the hostname and port of the
   140  statsd daemon.
   141  
   142  Note that we specify DogStatsD format instead of plain statsd, which tells Consul to send [tags][tagging]
   143  with each metric. Tags can be used by Grafana to filter data on your dashboards (for example, displaying only
   144  the data for which `role=consul-server`. Telegraf is compatible with the DogStatsD format and allows us to add
   145  our own tags too.
   146  
   147  The second option tells Consul not to insert the hostname in the names of the metrics it sends to statsd, since the hostnames will be sent as tags. Without this option, the single metric `consul.raft.apply` would become multiple metrics:
   148  
   149          consul.server1.raft.apply
   150          consul.server2.raft.apply
   151          consul.server3.raft.apply
   152  
   153  If you are using a different agent (e.g. Circonus, Statsite, or plain statsd), you may want to change this configuration, and you can find the configuration reference [here][consul-telemetry-config].
   154  
   155  ## Visualising Telegraf Consul Metrics
   156  
   157  There a number of ways of consuming the information from Telegraf. Generally they are visualised using a tool like [Grafana][] or [Chronograf][].
   158  
   159  Here is an example Grafana dashboard:
   160  
   161  <div class="center">
   162  [![Grafana Consul Cluster](/assets/images/grafana-screenshot.png)](/assets/images/grafana-screenshot.png)
   163  </div>
   164  
   165  
   166  ## Metric Aggregates and Alerting from Telegraf
   167  
   168  ### Memory usage
   169  
   170  | Metric Name | Description |
   171  | :---------- | :---------- |
   172  | `mem.total`                  | Total amount of physical memory (RAM) available on the server.    |
   173  | `mem.used_percent`           | Percentage of physical memory in use. |
   174  | `swap.used_percent`          | Percentage of swap space in use. |
   175  
   176  **Why they're important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.
   177  
   178  **What to look for:** If `mem.used_percent` is over 90%, or if
   179  `swap.used_percent` is greater than 0.
   180  
   181  ### File descriptors
   182  
   183  | Metric Name | Description |
   184  | :---------- | :---------- |
   185  | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
   186  | `linux_sysctl_fs.file-max` | Total number of available file handles. |
   187  
   188  **Why it's important:** Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more details.
   189  
   190  By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.
   191  
   192  **What to look for:** If `file-nr` exceeds 80% of `file-max`.
   193  
   194  ### CPU usage
   195  
   196  | Metric Name | Description |
   197  | :---------- | :---------- |
   198  | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). |
   199  | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |
   200  
   201  **Why they're important:** Consul is not particularly demanding of CPU time, but a spike in CPU usage might
   202  indicate too many operations taking place at once, and `iowait_cpu` is critical -- it means Consul is waiting
   203  for data to be written to disk, a sign that Raft might be writing snapshots to disk too often.
   204  
   205  **What to look for:** if `cpu.iowait_cpu` greater than 10%.
   206  
   207  ### Network activity - Bytes Recived
   208  
   209  | Metric Name | Description |
   210  | :---------- | :---------- |
   211  | `net.bytes_recv` | Bytes received on each network interface. |
   212  | `net.bytes_sent` | Bytes transmitted on each network interface. |
   213  
   214  **Why they're important:** A sudden spike in network traffic to Consul might be the result of a misconfigured
   215  application client causing too many requests to Consul. This is the raw data from the system, rather than a specific Consul metric.
   216  
   217  **What to look for:**
   218  Sudden large changes to the `net` metrics (greater than 50% deviation from baseline).
   219  
   220  **NOTE:** The `net` metrics are counters, so in order to calculate rates (such as bytes/second),
   221  you will need to apply a function such as [non_negative_difference][].
   222  
   223  ### Disk activity
   224  
   225  | Metric Name | Description |
   226  | :---------- | :---------- |
   227  | `diskio.read_bytes` | Bytes read from each block device. |
   228  | `diskio.write_bytes` | Bytes written to each block device. |
   229  
   230  **Why they're important:** If the Consul host is writing a lot of data to disk, such as under high volume workloads, there may be frequent major I/O spikes during leader elections. This is because under heavy load,
   231  Consul is checkpointing Raft snapshots to disk frequently.
   232  
   233  It may also be caused by Consul having debug/trace logging enabled in production, which can impact performance.
   234  
   235  Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.
   236  
   237  **What to look for:** Sudden large changes to the `diskio` metrics (greater than 50% deviation from baseline,
   238  or more than 3 standard deviations from baseline).
   239  
   240  **NOTE:** The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second),
   241  you will need to apply a function such as [non_negative_difference][].
   242  
   243  [non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference
   244  [consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits-
   245  [telegraf_plugin]: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul
   246  [telegraf-install]: https://docs.influxdata.com/telegraf/v1.6/introduction/installation/
   247  [telegraf-consul-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/consul
   248  [telegraf-statsd-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd
   249  [telegraf-procstat-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/procstat
   250  [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
   251  [tagging]: https://docs.datadoghq.com/getting_started/tagging/
   252  [consul-telemetry-config]: https://www.consul.io/docs/agent/options.html#telemetry
   253  [consul-telemetry-ref]: https://www.consul.io/docs/agent/telemetry.html
   254  [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
   255  [Grafana]: https://www.influxdata.com/partners/grafana/
   256  [Chronograf]: https://www.influxdata.com/time-series-platform/chronograf/