github.com/outbrain/consul@v1.4.5/website/source/docs/guides/cluster-monitoring-metrics.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Consul Cluster Monitoring & Metrics" 4 sidebar_current: "docs-guides-cluster-monitoring-metrics" 5 description: After setting up your first datacenter, it is an ideal time to make sure your cluster is healthy and establish a baseline. 6 --- 7 8 # Consul Cluster Monitoring and Metrics 9 10 After setting up your first datacenter, it is an ideal time to make sure your cluster is healthy and establish a baseline. This guide will cover several types of metrics in two sections: Consul health and server health. 11 12 **Consul health**: 13 14 - Transaction timing 15 - Leadership changes 16 - Autopilot 17 - Garbage collection 18 19 **Server health**: 20 21 - File descriptors 22 - CPU usage 23 - Network activity 24 - Disk activity 25 - Memory usage 26 27 For each type of metric, we will review their importance and help identify when a metric is indicating a healthy or unhealthy state. 28 29 First, we need to understand the three methods for collecting metrics. We will briefly cover using SIGUSR1, the HTTP API, and telemetry. 30 31 Before starting this guide, we recommend configuring [ACLs](/docs/guides/acl.html). 32 33 ## How to Collect Metrics 34 35 There are three methods for collecting metrics. The first, and simplest, is to use `SIGUSR1` for a one-time dump of current telemetry values. The second method is to get a similar one-time dump using the HTTP API. The third method, and the one most commonly used for long-term monitoring, is to enable telemetry in the Consul configuration file. 36 37 ### SIGUSR1 for Local Use 38 39 To get a one-time dump of current metric values, we can send the `SIGUSR1` signal to the Consul process. 40 41 ```sh 42 $ kill -USR1 <process_id> 43 ``` 44 This will send the output to the system logs, such as `/var/log/messages` or to `journald`. If you are monitoring the Consul process in the terminal via `consul monitor`, you will see the metrics in the output. 45 46 Although this is the easiest way to get a quick read of a single Consul agent’s health, it is much more useful to look at how the values change over time. 47 48 ### API GET Request 49 50 Next let’s use the HTTP API to quickly collect metrics with curl. 51 52 ```ssh 53 $ curl http://127.0.0.1:8500/v1/agent/metrics 54 ``` 55 56 In production you will want to set up credentials with an ACL token and [enable TLS](/docs/agent/encryption.html) for secure communications. Once ACLs have been configured, you can pass a token with the request. 57 58 ```sh 59 $ curl \ 60 --header "X-Consul-Token: <YOUR_ACL_TOKEN>" \ 61 https://127.0.0.1:8500/v1/agent/metrics 62 ``` 63 64 In addition to being a good way to quickly collect metrics, it can be added to a script or it can be used with monitoring agents that support HTTP scraping, such as Prometheus, to visualize the data. 65 66 ### Enable Telemetry 67 68 Finally, Consul can be configured to send telemetry data to a remote monitoring system. This allows you to monitor the health of agents over time, spot trends, and plan for future needs. You will need a monitoring agent and console for this. 69 70 Consul supports the following telemetry agents: 71 * Circonus 72 * DataDog (via `dogstatsd`) 73 * StatsD (via `statsd`, `statsite`, `telegraf`, etc.) 74 75 If you are using StatsD, you will also need a compatible database and server, such as Grafana, Chronograf, or Prometheus. 76 77 Telemetry can be enabled in the agent configuration file, for example `server.hcl`. Telemetry can be enabled on any agent, client or server. Normally, you would at least enable it on all the servers (both voting and non-voting) to monitor the health of the entire cluster. 78 79 An example snippet of `server.hcl` to send telemetry to DataDog looks like this: 80 81 ```json 82 "telemetry": { 83 "dogstatsd_addr": "localhost:8125", 84 "disable_hostname": true 85 } 86 ``` 87 88 When enabling telemetry on an existing cluster, the Consul process will need to be reloaded. This can be done with `consul reload` or `kill -HUP <process_id>`. It is recommended to reload the servers one at a time, starting with the non-leaders. 89 90 ## Consul Health 91 92 The Consul health metrics reveal information about the Consul cluster. They include performance metrics for the key value store, transactions, raft, leadership changes, autopilot tuning, and garbage collection. 93 94 ### Transaction Timing 95 96 The following metrics indicate how long it takes to complete write operations 97 in various parts, including Consul KV and Raft from the Consul server. Generally, these values should remain reasonably consistent and no more than a few milliseconds each. 98 99 | Metric Name | Description | 100 | :----------------------- | :---------- | 101 | `consul.kvs.apply` | Measures the time it takes to complete an update to the KV store. | 102 | `consul.txn.apply` | Measures the time spent applying a transaction operation. | 103 | `consul.raft.apply` | Counts the number of Raft transactions occurring over the interval. | 104 | `consul.raft.commitTime` | Measures the time it takes to commit a new entry to the Raft log on the leader. | 105 106 Sudden changes in any of the timing values could be due to unexpected load on the Consul servers or due to problems on the hosts themselves. Specifically, if any of these metrics deviate more than 50% from the baseline over the previous hour, this indicates an issue. Below are examples of healthy transaction metrics. 107 108 ```sh 109 'consul.raft.apply': Count: 1 Sum: 1.000 LastUpdated: 2018-11-16 10:55:03.673805766 -0600 CST m=+97598.238246167 110 'consul.raft.commitTime': Count: 1 Sum: 0.017 LastUpdated: 2018-11-16 10:55:03.673840104 -0600 CST m=+97598.238280505 111 ``` 112 113 ### Leadership Changes 114 115 In a healthy environment, your Consul cluster should have a stable leader. There shouldn’t be any leadership changes unless you manually change leadership (by taking a server out of the cluster, for example). If there are unexpected elections or leadership changes, you should investigate possible network issues between the Consul servers. Another possible cause could be that the Consul servers are unable to keep up with the transaction load. 116 117 Note: These metrics are reported by the follower nodes, not by the leader. 118 119 | Metric Name | Description | 120 | :---------- | :---------- | 121 | `consul.raft.leader.lastContact` | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. | 122 | `consul.raft.state.candidate` | Increments when a Consul server starts an election process. | 123 | `consul.raft.state.leader` | Increments when a Consul server becomes a leader. | 124 125 If the `candidate` or `leader` metrics are greater than 0 or the `lastContact` metric is greater than 200ms, you should look into one of the possible causes described above. Below are examples of healthy leadership metrics. 126 127 ```sh 128 'consul.raft.leader.lastContact': Count: 4 Min: 10.000 Mean: 31.000 Max: 50.000 Stddev: 17.088 Sum: 124.000 LastUpdated: 2018-12-17 22:06:08.872973122 +0000 UTC m=+3553.639379498 129 'consul.raft.state.leader': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.104580236 +0000 UTC m=+3533.870986584 130 'consul.raft.state.candidate': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.097186444 +0000 UTC m=+3533.863592815 131 ``` 132 133 ### Autopilot 134 135 The autopilot metric is a boolean. A value of 1 indicates a healthy cluster and 0 indicates an unhealthy state. 136 137 | Metric Name | Description | 138 | :---------- | :---------- | 139 | `consul.autopilot.healthy` | Tracks the overall health of the local server cluster. If all servers are considered healthy by autopilot, this will be set to 1. If any are unhealthy, this will be 0. | 140 141 An alert should be setup for a returned value of 0. Below is an example of a healthy cluster according to the autopilot metric. 142 143 ```sh 144 [2018-12-17 13:03:40 -0500 EST][G] 'consul.autopilot.healthy': 1.000 145 ``` 146 147 ### Garbage Collection 148 149 Garbage collection (GC) pauses are a "stop-the-world" event, all runtime threads are blocked until GC completes. In a healthy environment these pauses should only last a few nanoseconds. If memory usage is high, the Go runtime may start the GC process so frequently that it will slow down Consul. You might observe more frequent leader elections or longer write times. 150 151 | Metric Name | Description | 152 | :---------- | :---------- | 153 | `consul.runtime.total_gc_pause_ns` | Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started. | 154 155 If the value return is more than 2 seconds/minute, you should start investigating the cause. If it exceeds 5 seconds per minute, you should consider the cluster to be in a critical state and start ensuring failure recovery procedures are up-to-date and start investigating. Below is an example of healthy GC pause. 156 157 ```sh 158 'consul.runtime.total_gc_pause_ns': 136603664.000 159 ``` 160 161 Note, `total_gc_pause_ns` is a cumulative counter, so in order to calculate rates, such as GC/minute, you will need to apply a function such as [non_negative_difference](https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference). 162 163 ## Server Health 164 165 The server metrics provide information about the health of your cluster including file handles, CPU usage, network activity, disk activity, and memory usage. 166 167 ### File Descriptors 168 169 The majority of Consul operations require a file descriptor handle, including receiving a connection from another host, sending data between servers, and writing snapshots to disk. If Consul runs out of handles, it will stop accepting connections. 170 171 | Metric Name | Description | 172 | :---------- | :---------- | 173 | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. | 174 | `linux_sysctl_fs.file-max` | Total number of available file handles. | 175 176 By default, process and kernel limits are conservative, you may want to increase the limits beyond the defaults. If the `linux_sysctl_fs.file-nr` value exceeds 80% of `linux_sysctl_fs.file-max`, the file handles should be increased. Below is an example of a file handle metric. 177 178 ```sh 179 linux_sysctl_fs, host=statsbox, file-nr=768i, file-max=96763i 180 ``` 181 182 ### CPU Usage 183 184 Consul should not be demanding of CPU time on either server or clients. A spike in CPU usage could indicate too many operations taking place at once. 185 186 | Metric Name | Description | 187 | :---------- | :---------- | 188 | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Vault or Consul). | 189 | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. | 190 191 If `cpu.iowait_cpu` is greater than 10%, it should be considered critical as Consul is waiting for data to be written to disk. This could be a sign that Raft is writing snapshots to disk too often. Below is an example of a healthy CPU metric. 192 193 ```sh 194 cpu, cpu=cpu-total, usage_idle=99.298, usage_user=0.400, usage_system=0.300, usage_iowait=0, usage_steal=0 195 ``` 196 197 ### Network Activity 198 199 Network activity should be consistent. A sudden spike in network traffic to Consul might be the result of a misconfigured client, such as Vault, that is causing too many requests. 200 201 Most agents will report separate metrics for each network interface, so be sure you are monitoring the right one. 202 203 | Metric Name | Description | 204 | :---------- | :---------- | 205 | `net.bytes_recv` | Bytes received on each network interface. | 206 | `net.bytes_sent` | Bytes transmitted on each network interface. | 207 208 Sudden increases to the `net` metrics, greater than 50% deviation from baseline, indicates too many requests that are not being handled. Below is an example of a network activity metric. 209 210 ```sh 211 net, interface=enp0s5, bytes_sent=6183357i, bytes_recv=262313256i 212 ``` 213 214 Note: The `net` metrics are counters, so in order to calculate rates, such as bytes/second, 215 you will need to apply a function such as [non_negative_difference](https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference). 216 217 ### Disk Activity 218 219 Normally, there is low disk activity, because Consul keeps everything in memory. If the Consul host is writing a large amount of data to disk, it could mean that Consul is under heavy write load and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance. 220 221 | Metric Name | Description | 222 | :---------- | :---------- | 223 | `diskio.read_bytes` | Bytes read from each block device. | 224 | `diskio.write_bytes` | Bytes written to each block device. | 225 | `diskio.read_time` | Time spent reading from disk, in cumulative milliseconds. | 226 | `diskio.write_time` | Time spent writing to disk, in cumulative milliseconds. | 227 228 229 Sudden, large changes to the `diskio` metrics, greater than 50% deviation from baseline 230 or more than 3 standard deviations from baseline indicates Consul has too much disk I/O. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. Below are examples of disk activity metrics. 231 232 ```sh 233 diskio, name=sda5, read_bytes=522298368i, write_bytes=1726865408i, read_time=7248i, write_time=133364i 234 ``` 235 236 Note: The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second),you will need to apply a function such as [non_negative_difference][]. 237 238 ### Memory Usage 239 240 As noted previously, Consul keeps all of its data -- the KV store, the catalog, etc -- in memory. If Consul consumes all available memory, it will crash. You should monitor total available RAM to make sure some RAM is available for other system processes and swap usage should remain at 0% for best performance. 241 242 | Metric Name | Description | 243 | :---------- | :---------- | 244 | `consul.runtime.alloc_bytes` | Measures the number of bytes allocated by the Consul process. | 245 | `consul.runtime.sys_bytes` | The total number of bytes of memory obtained from the OS. | 246 | `mem.total` | Total amount of physical memory (RAM) available on the server. | 247 | `mem.used_percent` | Percentage of physical memory in use. | 248 | `swap.used_percent` | Percentage of swap space in use. | 249 250 Consul servers are running low on memory if `sys_bytes` exceeds 90% of `total_bytes`, `mem.used_percent` is over 90%, or `swap.used_percent` is greater than 0. You should increase the memory available to Consul if any of these three conditions are met. Below are examples of memory usage metrics. 251 252 ```sh 253 'consul.runtime.alloc_bytes': 11199928.000 254 'consul.runtime.sys_bytes': 24627448.000 255 mem, used_percent=31.492, total=1036312576i 256 swap, used_percent=1.343 257 ``` 258 259 ## Summary 260 261 In this guide we reviewed the three methods for collecting metrics. SIGUSR1 and agent HTTP API are both quick methods for collecting metrics, but enabling telemetry is the best method for moving data into monitoring software. Additionally, we outlined the various metrics collected and their significance. 262