github.com/outbrain/consul@v1.4.5/website/source/docs/guides/monitoring-telegraf.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Monitoring Consul with Telegraf" 4 sidebar_current: "docs-guides-monitoring-telegraf" 5 description: |- 6 Best practice approaches for monitoring a production Consul cluster with Telegraf 7 --- 8 9 # Monitoring Consul with Telegraf 10 11 Consul makes available a range of metrics in various formats in order to measure the health and stability of a cluster, and diagnose or predict potential issues. 12 13 There are number of monitoring tools and options, but for the purposes of this guide we are going to use the [telegraf_plugin][] in conjunction with the Statsd protocol supported by Consul. 14 15 You can read the full breakdown of metrics with Consul in the [telemetry documentation](/docs/agent/telemetry.html) 16 17 ## Configuring Telegraf 18 19 # Installing Telegraf 20 21 Installing Telegraf is straightforward on most Linux distributions. We recommend following the [official Telegraf installation documentation][telegraf-install]. 22 23 # Configuring Telegraf 24 25 Besides acting as a statsd agent, Telegraf can collect additional metrics about the host that the Consul agent is running on. Telegraf itself ships with a wide range of [input plugins][telegraf-input-plugins] to collect data from lots of sources for this purpose. 26 27 We're going to enable some of the most common ones to monitor CPU, memory, disk I/O, networking, and process status, as these are useful for debugging Consul cluster issues. 28 29 The `telegraf.conf` file starts with global options: 30 31 ```ini 32 [agent] 33 interval = "10s" 34 flush_interval = "10s" 35 omit_hostname = false 36 ``` 37 38 We set the default collection interval to 10 seconds and ask Telegraf to include a `host` tag in each metric. 39 40 As mentioned above, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, we are adding tags for the server role and datacenter. We can then use these tags in Grafana to filter queries (for example, to create a dashboard showing only servers with the `consul-server` role, or only servers in the `us-east-1` datacenter). 41 42 ```ini 43 [global_tags] 44 role = "consul-server" 45 datacenter = "us-east-1" 46 ``` 47 48 Next, we set up a statsd listener on UDP port 8125, with instructions to calculate percentile metrics and to 49 parse DogStatsD-compatible tags, when they're sent: 50 51 ```ini 52 [[inputs.statsd]] 53 protocol = "udp" 54 service_address = ":8125" 55 delete_gauges = true 56 delete_counters = true 57 delete_sets = true 58 delete_timings = true 59 percentiles = [90] 60 metric_separator = "_" 61 parse_data_dog_tags = true 62 allowed_pending_messages = 10000 63 percentile_limit = 1000 64 ``` 65 66 The full reference to all the available statsd-related options in Telegraf is [here][telegraf-statsd-input]. 67 68 Now we can configure inputs for things like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure the `interfaces` list in `inputs.net` matches the interface names you see in `ifconfig`. 69 70 ```ini 71 [[inputs.cpu]] 72 percpu = true 73 totalcpu = true 74 collect_cpu_time = false 75 76 [[inputs.disk]] 77 # mount_points = ["/"] 78 # ignore_fs = ["tmpfs", "devtmpfs"] 79 80 [[inputs.diskio]] 81 # devices = ["sda", "sdb"] 82 # skip_serial_number = false 83 84 [[inputs.kernel]] 85 # no configuration 86 87 [[inputs.linux_sysctl_fs]] 88 # no configuration 89 90 [[inputs.mem]] 91 # no configuration 92 93 [[inputs.net]] 94 interfaces = ["enp0s*"] 95 96 [[inputs.netstat]] 97 # no configuration 98 99 [[inputs.processes]] 100 # no configuration 101 102 [[inputs.swap]] 103 # no configuration 104 105 [[inputs.system]] 106 # no configuration 107 ``` 108 109 Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which reports metrics for processes you select: 110 111 ```ini 112 [[inputs.procstat]] 113 pattern = "(consul)" 114 ``` 115 116 Telegraf even includes a [plugin][telegraf-consul-input] that monitors the health checks associated with the Consul agent, using Consul API to query the data. 117 118 It's important to note: the plugin itself will not report the telemetry, Consul will report those stats already using StatsD protocol. 119 120 ```ini 121 [[inputs.consul]] 122 address = "localhost:8500" 123 scheme = "http" 124 ``` 125 126 ## Telegraf Configuration for Consul 127 128 Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry` section to your agent configuration: 129 130 ```json 131 { 132 "telemetry": { 133 "dogstatsd_addr": "localhost:8125", 134 "disable_hostname": true 135 } 136 } 137 ``` 138 139 As you can see, we only need to specify two options. The `dogstatsd_addr` specifies the hostname and port of the 140 statsd daemon. 141 142 Note that we specify DogStatsD format instead of plain statsd, which tells Consul to send [tags][tagging] 143 with each metric. Tags can be used by Grafana to filter data on your dashboards (for example, displaying only 144 the data for which `role=consul-server`. Telegraf is compatible with the DogStatsD format and allows us to add 145 our own tags too. 146 147 The second option tells Consul not to insert the hostname in the names of the metrics it sends to statsd, since the hostnames will be sent as tags. Without this option, the single metric `consul.raft.apply` would become multiple metrics: 148 149 consul.server1.raft.apply 150 consul.server2.raft.apply 151 consul.server3.raft.apply 152 153 If you are using a different agent (e.g. Circonus, Statsite, or plain statsd), you may want to change this configuration, and you can find the configuration reference [here][consul-telemetry-config]. 154 155 ## Visualising Telegraf Consul Metrics 156 157 There a number of ways of consuming the information from Telegraf. Generally they are visualised using a tool like [Grafana][] or [Chronograf][]. 158 159 Here is an example Grafana dashboard: 160 161 <div class="center"> 162 [![Grafana Consul Cluster](/assets/images/grafana-screenshot.png)](/assets/images/grafana-screenshot.png) 163 </div> 164 165 166 ## Metric Aggregates and Alerting from Telegraf 167 168 ### Memory usage 169 170 | Metric Name | Description | 171 | :---------- | :---------- | 172 | `mem.total` | Total amount of physical memory (RAM) available on the server. | 173 | `mem.used_percent` | Percentage of physical memory in use. | 174 | `swap.used_percent` | Percentage of swap space in use. | 175 176 **Why they're important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance. 177 178 **What to look for:** If `mem.used_percent` is over 90%, or if 179 `swap.used_percent` is greater than 0. 180 181 ### File descriptors 182 183 | Metric Name | Description | 184 | :---------- | :---------- | 185 | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. | 186 | `linux_sysctl_fs.file-max` | Total number of available file handles. | 187 188 **Why it's important:** Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more details. 189 190 By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults. 191 192 **What to look for:** If `file-nr` exceeds 80% of `file-max`. 193 194 ### CPU usage 195 196 | Metric Name | Description | 197 | :---------- | :---------- | 198 | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). | 199 | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. | 200 201 **Why they're important:** Consul is not particularly demanding of CPU time, but a spike in CPU usage might 202 indicate too many operations taking place at once, and `iowait_cpu` is critical -- it means Consul is waiting 203 for data to be written to disk, a sign that Raft might be writing snapshots to disk too often. 204 205 **What to look for:** if `cpu.iowait_cpu` greater than 10%. 206 207 ### Network activity - Bytes Recived 208 209 | Metric Name | Description | 210 | :---------- | :---------- | 211 | `net.bytes_recv` | Bytes received on each network interface. | 212 | `net.bytes_sent` | Bytes transmitted on each network interface. | 213 214 **Why they're important:** A sudden spike in network traffic to Consul might be the result of a misconfigured 215 application client causing too many requests to Consul. This is the raw data from the system, rather than a specific Consul metric. 216 217 **What to look for:** 218 Sudden large changes to the `net` metrics (greater than 50% deviation from baseline). 219 220 **NOTE:** The `net` metrics are counters, so in order to calculate rates (such as bytes/second), 221 you will need to apply a function such as [non_negative_difference][]. 222 223 ### Disk activity 224 225 | Metric Name | Description | 226 | :---------- | :---------- | 227 | `diskio.read_bytes` | Bytes read from each block device. | 228 | `diskio.write_bytes` | Bytes written to each block device. | 229 230 **Why they're important:** If the Consul host is writing a lot of data to disk, such as under high volume workloads, there may be frequent major I/O spikes during leader elections. This is because under heavy load, 231 Consul is checkpointing Raft snapshots to disk frequently. 232 233 It may also be caused by Consul having debug/trace logging enabled in production, which can impact performance. 234 235 Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. 236 237 **What to look for:** Sudden large changes to the `diskio` metrics (greater than 50% deviation from baseline, 238 or more than 3 standard deviations from baseline). 239 240 **NOTE:** The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second), 241 you will need to apply a function such as [non_negative_difference][]. 242 243 [non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference 244 [consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits- 245 [telegraf_plugin]: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul 246 [telegraf-install]: https://docs.influxdata.com/telegraf/v1.6/introduction/installation/ 247 [telegraf-consul-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/consul 248 [telegraf-statsd-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd 249 [telegraf-procstat-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/procstat 250 [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/ 251 [tagging]: https://docs.datadoghq.com/getting_started/tagging/ 252 [consul-telemetry-config]: https://www.consul.io/docs/agent/options.html#telemetry 253 [consul-telemetry-ref]: https://www.consul.io/docs/agent/telemetry.html 254 [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/ 255 [Grafana]: https://www.influxdata.com/partners/grafana/ 256 [Chronograf]: https://www.influxdata.com/time-series-platform/chronograf/