github.com/sl1pm4t/consul@v1.4.5-0.20190325224627-74c31c540f9c/website/source/docs/guides/monitoring-telegraf.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Monitoring Consul with Telegraf" 4 sidebar_current: "docs-guides-monitoring-telegraf" 5 description: |- 6 Best practice approaches for monitoring a production Consul cluster with Telegraf 7 --- 8 9 # Monitoring Consul with Telegraf 10 11 Consul makes a range of metrics in various formats available so operators can 12 measure the health and stability of a cluster, and diagnose or predict potential 13 issues. 14 15 There are number of monitoring tools and options available, but for the purposes 16 of this guide we are going to use the [telegraf_plugin][] in conjunction with 17 the StatsD protocol supported by Consul. 18 19 You can read the full list of metrics available with Consul in the [telemetry 20 documentation](/docs/agent/telemetry.html). 21 22 In this guide you will: 23 24 - Configure Telegraf to collect StatsD and host level metrics 25 - Configure Consul to send metrics to Telegraf 26 - See an example of metrics visualization 27 - Understand important metrics to aggregate and alert on 28 29 ## Installing Telegraf 30 31 The process for installing Telegraf depends on your operating system. We 32 recommend following the [official Telegraf installation 33 documentation][telegraf-install]. 34 35 ## Configuring Telegraf 36 37 Telegraf acts as a StatsD agent and can collect additional metrics about the 38 hosts where Consul agents are running. Telegraf itself ships with a wide range 39 of [input plugins][telegraf-input-plugins] to collect data from lots of sources 40 for this purpose. 41 42 We're going to enable some of the most common input plugins to monitor CPU, 43 memory, disk I/O, networking, and process status, since these are useful for 44 debugging Consul cluster issues. 45 46 The `telegraf.conf` file starts with global options: 47 48 ```toml 49 [agent] 50 interval = "10s" 51 flush_interval = "10s" 52 omit_hostname = false 53 ``` 54 55 We set the default collection interval to 10 seconds and ask Telegraf to include 56 a `host` tag in each metric. 57 58 As mentioned above, Telegraf also allows you to set additional tags on the 59 metrics that pass through it. In this case, we are adding tags for the server 60 role and datacenter. We can then use these tags in Grafana to filter queries 61 (for example, to create a dashboard showing only servers with the 62 `consul-server` role, or only servers in the `us-east-1` datacenter). 63 64 ```toml 65 [global_tags] 66 role = "consul-server" 67 datacenter = "us-east-1" 68 ``` 69 70 Next, we set up a StatsD listener on UDP port 8125, with instructions to 71 calculate percentile metrics and to parse DogStatsD-compatible tags, when 72 they're sent: 73 74 ```toml 75 [[inputs.statsd]] 76 protocol = "udp" 77 service_address = ":8125" 78 delete_gauges = true 79 delete_counters = true 80 delete_sets = true 81 delete_timings = true 82 percentiles = [90] 83 metric_separator = "_" 84 parse_data_dog_tags = true 85 allowed_pending_messages = 10000 86 percentile_limit = 1000 87 ``` 88 89 The full reference to all the available StatsD-related options in Telegraf is 90 [here][telegraf-statsd-input]. 91 92 Now we can configure inputs for things like CPU, memory, network I/O, and disk 93 I/O. Most of them don't require any configuration, but make sure the 94 `interfaces` list in `inputs.net` matches the interface names you see in 95 `ifconfig`. 96 97 ```toml 98 [[inputs.cpu]] 99 percpu = true 100 totalcpu = true 101 collect_cpu_time = false 102 103 [[inputs.disk]] 104 # mount_points = ["/"] 105 # ignore_fs = ["tmpfs", "devtmpfs"] 106 107 [[inputs.diskio]] 108 # devices = ["sda", "sdb"] 109 # skip_serial_number = false 110 111 [[inputs.kernel]] 112 # no configuration 113 114 [[inputs.linux_sysctl_fs]] 115 # no configuration 116 117 [[inputs.mem]] 118 # no configuration 119 120 [[inputs.net]] 121 interfaces = ["enp0s*"] 122 123 [[inputs.netstat]] 124 # no configuration 125 126 [[inputs.processes]] 127 # no configuration 128 129 [[inputs.swap]] 130 # no configuration 131 132 [[inputs.system]] 133 # no configuration 134 ``` 135 136 Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which 137 reports metrics for processes you select: 138 139 ```toml 140 [[inputs.procstat]] 141 pattern = "(consul)" 142 ``` 143 144 Telegraf even includes a [plugin][telegraf-consul-input] that monitors the 145 health checks associated with the Consul agent, using Consul API to query the 146 data. 147 148 It's important to note: the plugin itself will not report the telemetry, Consul 149 will report those stats already using StatsD protocol. 150 151 ```toml 152 [[inputs.consul]] 153 address = "localhost:8500" 154 scheme = "http" 155 ``` 156 157 ## Telegraf Configuration for Consul 158 159 Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry` 160 section to your agent configuration: 161 162 ```json 163 { 164 "telemetry": { 165 "dogstatsd_addr": "localhost:8125", 166 "disable_hostname": true 167 } 168 } 169 ``` 170 171 As you can see, we only need to specify two options. The `dogstatsd_addr` 172 specifies the hostname and port of the StatsD daemon. 173 174 Note that we specify DogStatsD format instead of plain StatsD, which tells 175 Consul to send [tags][tagging] with each metric. Tags can be used by Grafana to 176 filter data on your dashboards (for example, displaying only the data for which 177 `role=consul-server`. Telegraf is compatible with the DogStatsD format and 178 allows us to add our own tags too. 179 180 The second option tells Consul not to insert the hostname in the names of the 181 metrics it sends to StatsD, since the hostnames will be sent as tags. Without 182 this option, the single metric `consul.raft.apply` would become multiple 183 metrics: 184 185 consul.server1.raft.apply 186 consul.server2.raft.apply 187 consul.server3.raft.apply 188 189 If you are using a different agent (e.g. Circonus, Statsite, or plain StatsD), 190 you may want to change this configuration, and you can find the configuration 191 reference [here][consul-telemetry-config]. 192 193 ## Visualising Telegraf Consul Metrics 194 195 You can use a tool like [Grafana][] or [Chronograf][] to visualize metrics from 196 Telegraf. 197 198 Here is an example Grafana dashboard: 199 200 <div class="center"> 201 [](/assets/images/grafana-screenshot.png) 202 </div> 203 204 205 ## Metric Aggregates and Alerting from Telegraf 206 207 ### Memory usage 208 209 | Metric Name | Description | 210 | :---------- | :---------- | 211 | `mem.total` | Total amount of physical memory (RAM) available on the server. | 212 | `mem.used_percent` | Percentage of physical memory in use. | 213 | `swap.used_percent` | Percentage of swap space in use. | 214 215 **Why they're important:** Consul keeps all of its data in memory. If Consul 216 consumes all available memory, it will crash. You should also monitor total 217 available RAM to make sure some RAM is available for other processes, and swap 218 usage should remain at 0% for best performance. 219 220 **What to look for:** If `mem.used_percent` is over 90%, or if 221 `swap.used_percent` is greater than 0. 222 223 ### File descriptors 224 225 | Metric Name | Description | 226 | :---------- | :---------- | 227 | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. | 228 | `linux_sysctl_fs.file-max` | Total number of available file handles. | 229 230 **Why it's important:** Practically anything Consul does -- receiving a 231 connection from another host, sending data between servers, writing snapshots to 232 disk -- requires a file descriptor handle. If Consul runs out of handles, it 233 will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more 234 details. 235 236 By default, process and kernel limits are fairly conservative. You will want to 237 increase these beyond the defaults. 238 239 **What to look for:** If `file-nr` exceeds 80% of `file-max`. 240 241 ### CPU usage 242 243 | Metric Name | Description | 244 | :---------- | :---------- | 245 | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). | 246 | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. | 247 248 **Why they're important:** Consul is not particularly demanding of CPU time, but 249 a spike in CPU usage might indicate too many operations taking place at once, 250 and `iowait_cpu` is critical -- it means Consul is waiting for data to be 251 written to disk, a sign that Raft might be writing snapshots to disk too often. 252 253 **What to look for:** if `cpu.iowait_cpu` greater than 10%. 254 255 ### Network activity - Bytes Recived 256 257 | Metric Name | Description | 258 | :---------- | :---------- | 259 | `net.bytes_recv` | Bytes received on each network interface. | 260 | `net.bytes_sent` | Bytes transmitted on each network interface. | 261 262 **Why they're important:** A sudden spike in network traffic to Consul might be 263 the result of a misconfigured application client causing too many requests to 264 Consul. This is the raw data from the system, rather than a specific Consul 265 metric. 266 267 **What to look for:** Sudden large changes to the `net` metrics (greater than 268 50% deviation from baseline). 269 270 **NOTE:** The `net` metrics are counters, so in order to calculate rates (such 271 as bytes/second), you will need to apply a function such as 272 [non_negative_difference][]. 273 274 ### Disk activity 275 276 | Metric Name | Description | 277 | :---------- | :---------- | 278 | `diskio.read_bytes` | Bytes read from each block device. | 279 | `diskio.write_bytes` | Bytes written to each block device. | 280 281 **Why they're important:** If the Consul host is writing a lot of data to disk, 282 such as under high volume workloads, there may be frequent major I/O spikes 283 during leader elections. This is because under heavy load, Consul is 284 checkpointing Raft snapshots to disk frequently. 285 286 It may also be caused by Consul having debug/trace logging enabled in 287 production, which can impact performance. 288 289 Too much disk I/O can cause the rest of the system to slow down or become 290 unavailable, as the kernel spends all its time waiting for I/O to complete. 291 292 **What to look for:** Sudden large changes to the `diskio` metrics (greater than 293 50% deviation from baseline, or more than 3 standard deviations from baseline). 294 295 **NOTE:** The `diskio` metrics are counters, so in order to calculate rates 296 (such as bytes/second), you will need to apply a function such as 297 [non_negative_difference][]. 298 299 ## Summary 300 301 In this guide you learned how to set up Telegraf with Consul to collect metrics, 302 and considered your options for visualizing, aggregating, and alerting on those 303 metrics. To learn about other factors (in addition to monitoring) that you 304 should consider when running Consul in production, see the [Production Checklist][prod-checklist]. 305 306 [non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference 307 [consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits- 308 [telegraf_plugin]: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul 309 [telegraf-install]: https://docs.influxdata.com/telegraf/v1.6/introduction/installation/ 310 [telegraf-consul-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/consul 311 [telegraf-statsd-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd 312 [telegraf-procstat-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/procstat 313 [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/ 314 [tagging]: https://docs.datadoghq.com/getting_started/tagging/ 315 [consul-telemetry-config]: https://www.consul.io/docs/agent/options.html#telemetry 316 [consul-telemetry-ref]: https://www.consul.io/docs/agent/telemetry.html 317 [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/ 318 [Grafana]: https://www.influxdata.com/partners/grafana/ 319 [Chronograf]: https://www.influxdata.com/time-series-platform/chronograf/ 320 [prod-checklist]: https://learn.hashicorp.com/consul/advanced/day-1-operations/production-checklist