github.com/sl1pm4t/consul@v1.4.5-0.20190325224627-74c31c540f9c/website/source/docs/guides/monitoring-telegraf.html.md

github.com/sl1pm4t/consul@v1.4.5-0.20190325224627-74c31c540f9c/website/source/docs/guides/monitoring-telegraf.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Monitoring Consul with Telegraf"
     4  sidebar_current: "docs-guides-monitoring-telegraf"
     5  description: |-
     6    Best practice approaches for monitoring a production Consul cluster with Telegraf
     7  ---
     8  
     9  # Monitoring Consul with Telegraf
    10  
    11  Consul makes a range of metrics in various formats available so operators can
    12  measure the health and stability of a cluster, and diagnose or predict potential
    13  issues.
    14  
    15  There are number of monitoring tools and options available, but for the purposes
    16  of this guide we are going to use the [telegraf_plugin][] in conjunction with
    17  the StatsD protocol supported by Consul.
    18  
    19  You can read the full list of metrics available with Consul in the [telemetry
    20  documentation](/docs/agent/telemetry.html).
    21  
    22  In this guide you will:
    23  
    24  - Configure Telegraf to collect StatsD and host level metrics
    25  - Configure Consul to send metrics to Telegraf
    26  - See an example of metrics visualization
    27  - Understand important metrics to aggregate and alert on
    28  
    29  ## Installing Telegraf
    30  
    31  The process for installing Telegraf depends on your operating system. We
    32  recommend following the [official Telegraf installation
    33  documentation][telegraf-install].
    34  
    35  ## Configuring Telegraf
    36  
    37  Telegraf acts as a StatsD agent and can collect additional metrics about the
    38  hosts where Consul agents are running. Telegraf itself ships with a wide range
    39  of [input plugins][telegraf-input-plugins] to collect data from lots of sources
    40  for this purpose.
    41  
    42  We're going to enable some of the most common input plugins to monitor CPU,
    43  memory, disk I/O, networking, and process status, since these are useful for
    44  debugging Consul cluster issues.
    45  
    46  The `telegraf.conf` file starts with global options:
    47  
    48  ```toml
    49  [agent]
    50    interval = "10s"
    51    flush_interval = "10s"
    52    omit_hostname = false
    53  ```
    54  
    55  We set the default collection interval to 10 seconds and ask Telegraf to include
    56  a `host` tag in each metric.
    57  
    58  As mentioned above, Telegraf also allows you to set additional tags on the
    59  metrics that pass through it. In this case, we are adding tags for the server
    60  role and datacenter. We can then use these tags in Grafana to filter queries
    61  (for example, to create a dashboard showing only servers with the
    62  `consul-server` role, or only servers in the `us-east-1` datacenter).
    63  
    64  ```toml
    65  [global_tags]
    66    role = "consul-server"
    67    datacenter = "us-east-1"
    68  ```
    69  
    70  Next, we set up a StatsD listener on UDP port 8125, with instructions to
    71  calculate percentile metrics and to parse DogStatsD-compatible tags, when
    72  they're sent:
    73  
    74  ```toml
    75  [[inputs.statsd]]
    76    protocol = "udp"
    77    service_address = ":8125"
    78    delete_gauges = true
    79    delete_counters = true
    80    delete_sets = true
    81    delete_timings = true
    82    percentiles = [90]
    83    metric_separator = "_"
    84    parse_data_dog_tags = true
    85    allowed_pending_messages = 10000
    86    percentile_limit = 1000
    87  ```
    88  
    89  The full reference to all the available StatsD-related options in Telegraf is
    90  [here][telegraf-statsd-input].
    91  
    92  Now we can configure inputs for things like CPU, memory, network I/O, and disk
    93  I/O. Most of them don't require any configuration, but make sure the
    94  `interfaces` list in `inputs.net` matches the interface names you see in
    95  `ifconfig`.
    96  
    97  ```toml
    98  [[inputs.cpu]]
    99    percpu = true
   100    totalcpu = true
   101    collect_cpu_time = false
   102  
   103  [[inputs.disk]]
   104    # mount_points = ["/"]
   105    # ignore_fs = ["tmpfs", "devtmpfs"]
   106  
   107  [[inputs.diskio]]
   108    # devices = ["sda", "sdb"]
   109    # skip_serial_number = false
   110  
   111  [[inputs.kernel]]
   112    # no configuration
   113  
   114  [[inputs.linux_sysctl_fs]]
   115    # no configuration
   116  
   117  [[inputs.mem]]
   118    # no configuration
   119  
   120  [[inputs.net]]
   121    interfaces = ["enp0s*"]
   122  
   123  [[inputs.netstat]]
   124    # no configuration
   125  
   126  [[inputs.processes]]
   127    # no configuration
   128  
   129  [[inputs.swap]]
   130    # no configuration
   131  
   132  [[inputs.system]]
   133    # no configuration
   134  ```
   135  
   136  Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which
   137  reports metrics for processes you select:
   138  
   139  ```toml
   140  [[inputs.procstat]]
   141    pattern = "(consul)"
   142  ```
   143  
   144  Telegraf even includes a [plugin][telegraf-consul-input] that monitors the
   145  health checks associated with the Consul agent, using Consul API to query the
   146  data.
   147  
   148  It's important to note: the plugin itself will not report the telemetry, Consul
   149  will report those stats already using StatsD protocol.
   150  
   151  ```toml
   152  [[inputs.consul]]
   153    address = "localhost:8500"
   154    scheme = "http"
   155  ```
   156  
   157  ## Telegraf Configuration for Consul
   158  
   159  Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry`
   160  section to your agent configuration:
   161  
   162  ```json
   163  {
   164    "telemetry": {
   165      "dogstatsd_addr": "localhost:8125",
   166      "disable_hostname": true
   167    }
   168  }
   169  ```
   170  
   171  As you can see, we only need to specify two options. The `dogstatsd_addr`
   172  specifies the hostname and port of the StatsD daemon.
   173  
   174  Note that we specify DogStatsD format instead of plain StatsD, which tells
   175  Consul to send [tags][tagging] with each metric. Tags can be used by Grafana to
   176  filter data on your dashboards (for example, displaying only the data for which
   177  `role=consul-server`. Telegraf is compatible with the DogStatsD format and
   178  allows us to add our own tags too.
   179  
   180  The second option tells Consul not to insert the hostname in the names of the
   181  metrics it sends to StatsD, since the hostnames will be sent as tags. Without
   182  this option, the single metric `consul.raft.apply` would become multiple
   183  metrics:
   184  
   185          consul.server1.raft.apply
   186          consul.server2.raft.apply
   187          consul.server3.raft.apply
   188  
   189  If you are using a different agent (e.g. Circonus, Statsite, or plain StatsD),
   190  you may want to change this configuration, and you can find the configuration
   191  reference [here][consul-telemetry-config].
   192  
   193  ## Visualising Telegraf Consul Metrics
   194  
   195  You can use a tool like [Grafana][] or [Chronograf][] to visualize metrics from
   196  Telegraf.
   197  
   198  Here is an example Grafana dashboard:
   199  
   200  <div class="center">
   201  [![Grafana Consul Cluster](/assets/images/grafana-screenshot.png)](/assets/images/grafana-screenshot.png)
   202  </div>
   203  
   204  
   205  ## Metric Aggregates and Alerting from Telegraf
   206  
   207  ### Memory usage
   208  
   209  | Metric Name | Description |
   210  | :---------- | :---------- |
   211  | `mem.total`                  | Total amount of physical memory (RAM) available on the server.    |
   212  | `mem.used_percent`           | Percentage of physical memory in use. |
   213  | `swap.used_percent`          | Percentage of swap space in use. |
   214  
   215  **Why they're important:** Consul keeps all of its data in memory. If Consul
   216  consumes all available memory, it will crash. You should also monitor total
   217  available RAM to make sure some RAM is available for other processes, and swap
   218  usage should remain at 0% for best performance.
   219  
   220  **What to look for:** If `mem.used_percent` is over 90%, or if
   221  `swap.used_percent` is greater than 0.
   222  
   223  ### File descriptors
   224  
   225  | Metric Name | Description |
   226  | :---------- | :---------- |
   227  | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
   228  | `linux_sysctl_fs.file-max` | Total number of available file handles. |
   229  
   230  **Why it's important:** Practically anything Consul does -- receiving a
   231  connection from another host, sending data between servers, writing snapshots to
   232  disk -- requires a file descriptor handle. If Consul runs out of handles, it
   233  will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more
   234  details.
   235  
   236  By default, process and kernel limits are fairly conservative. You will want to
   237  increase these beyond the defaults.
   238  
   239  **What to look for:** If `file-nr` exceeds 80% of `file-max`.
   240  
   241  ### CPU usage
   242  
   243  | Metric Name | Description |
   244  | :---------- | :---------- |
   245  | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). |
   246  | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |
   247  
   248  **Why they're important:** Consul is not particularly demanding of CPU time, but
   249  a spike in CPU usage might indicate too many operations taking place at once,
   250  and `iowait_cpu` is critical -- it means Consul is waiting for data to be
   251  written to disk, a sign that Raft might be writing snapshots to disk too often.
   252  
   253  **What to look for:** if `cpu.iowait_cpu` greater than 10%.
   254  
   255  ### Network activity - Bytes Recived
   256  
   257  | Metric Name | Description |
   258  | :---------- | :---------- |
   259  | `net.bytes_recv` | Bytes received on each network interface. |
   260  | `net.bytes_sent` | Bytes transmitted on each network interface. |
   261  
   262  **Why they're important:** A sudden spike in network traffic to Consul might be
   263  the result of a misconfigured application client causing too many requests to
   264  Consul. This is the raw data from the system, rather than a specific Consul
   265  metric.
   266  
   267  **What to look for:** Sudden large changes to the `net` metrics (greater than
   268  50% deviation from baseline).
   269  
   270  **NOTE:** The `net` metrics are counters, so in order to calculate rates (such
   271  as bytes/second), you will need to apply a function such as
   272  [non_negative_difference][].
   273  
   274  ### Disk activity
   275  
   276  | Metric Name | Description |
   277  | :---------- | :---------- |
   278  | `diskio.read_bytes` | Bytes read from each block device. |
   279  | `diskio.write_bytes` | Bytes written to each block device. |
   280  
   281  **Why they're important:** If the Consul host is writing a lot of data to disk,
   282  such as under high volume workloads, there may be frequent major I/O spikes
   283  during leader elections. This is because under heavy load, Consul is
   284  checkpointing Raft snapshots to disk frequently.
   285  
   286  It may also be caused by Consul having debug/trace logging enabled in
   287  production, which can impact performance.
   288  
   289  Too much disk I/O can cause the rest of the system to slow down or become
   290  unavailable, as the kernel spends all its time waiting for I/O to complete.
   291  
   292  **What to look for:** Sudden large changes to the `diskio` metrics (greater than
   293  50% deviation from baseline, or more than 3 standard deviations from baseline).
   294  
   295  **NOTE:** The `diskio` metrics are counters, so in order to calculate rates
   296  (such as bytes/second), you will need to apply a function such as
   297  [non_negative_difference][].
   298  
   299  ## Summary
   300  
   301  In this guide you learned how to set up Telegraf with Consul to collect metrics,
   302  and considered your options for visualizing, aggregating, and alerting on those
   303  metrics. To learn about other factors (in addition to monitoring) that you
   304  should consider when running Consul in production, see the [Production Checklist][prod-checklist].
   305  
   306  [non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference
   307  [consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits-
   308  [telegraf_plugin]: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul
   309  [telegraf-install]: https://docs.influxdata.com/telegraf/v1.6/introduction/installation/
   310  [telegraf-consul-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/consul
   311  [telegraf-statsd-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd
   312  [telegraf-procstat-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/procstat
   313  [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
   314  [tagging]: https://docs.datadoghq.com/getting_started/tagging/
   315  [consul-telemetry-config]: https://www.consul.io/docs/agent/options.html#telemetry
   316  [consul-telemetry-ref]: https://www.consul.io/docs/agent/telemetry.html
   317  [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
   318  [Grafana]: https://www.influxdata.com/partners/grafana/
   319  [Chronograf]: https://www.influxdata.com/time-series-platform/chronograf/
   320  [prod-checklist]: https://learn.hashicorp.com/consul/advanced/day-1-operations/production-checklist