github.com/outbrain/consul@v1.4.5/website/source/docs/guides/cluster-monitoring-metrics.html.md

github.com/outbrain/consul@v1.4.5/website/source/docs/guides/cluster-monitoring-metrics.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Consul Cluster Monitoring & Metrics"
     4  sidebar_current: "docs-guides-cluster-monitoring-metrics"
     5  description: After setting up your first datacenter, it is an ideal time to make sure your cluster is healthy and establish a baseline.
     6  ---
     7  
     8  # Consul Cluster Monitoring and Metrics
     9  
    10  After setting up your first datacenter, it is an ideal time to make sure your cluster is healthy and establish a baseline. This guide will cover several types of metrics in two sections: Consul health and server health. 
    11  
    12  **Consul health**:
    13  
    14  - Transaction timing
    15  - Leadership changes
    16  - Autopilot
    17  - Garbage collection
    18  
    19  **Server health**:
    20  
    21  - File descriptors
    22  - CPU usage
    23  - Network activity
    24  - Disk activity
    25  - Memory usage
    26  
    27  For each type of metric, we will review their importance and help identify when a metric is indicating a healthy or unhealthy state. 
    28  
    29  First, we need to understand the three methods for collecting metrics. We will briefly cover using SIGUSR1, the HTTP API, and telemetry. 
    30  
    31  Before starting this guide, we recommend configuring [ACLs](/docs/guides/acl.html).
    32  
    33  ## How to Collect Metrics
    34  
    35  There are three methods for collecting metrics. The first, and simplest, is to use `SIGUSR1` for a one-time dump of current telemetry values. The second method is to get a similar one-time dump using the HTTP API. The third method, and the one most commonly used for long-term monitoring, is to enable telemetry in the Consul configuration file. 
    36  
    37  ### SIGUSR1 for Local Use
    38  
    39  To get a one-time dump of current metric values, we can send the `SIGUSR1` signal to the Consul process.
    40  
    41  ```sh
    42  $ kill -USR1 <process_id>
    43  ```
    44  This will send the output to the system logs, such as `/var/log/messages` or to `journald`. If you are monitoring the Consul process in the terminal via `consul monitor`, you will see the metrics in the output.
    45  
    46  Although this is the easiest way to get a quick read of a single Consul agent’s health, it is much more useful to look at how the values change over time. 
    47  
    48  ### API GET Request
    49  
    50  Next let’s use the HTTP API to quickly collect metrics with curl.
    51  
    52  ```ssh
    53  $ curl http://127.0.0.1:8500/v1/agent/metrics
    54  ```
    55  
    56  In production you will want to set up credentials with an ACL token and [enable TLS](/docs/agent/encryption.html) for secure communications. Once ACLs have been configured, you can pass a token with the request.
    57  
    58  ```sh
    59  $ curl \
    60      --header "X-Consul-Token: <YOUR_ACL_TOKEN>" \
    61      https://127.0.0.1:8500/v1/agent/metrics
    62  ```
    63  
    64  In addition to being a good way to quickly collect metrics, it can be added to a script or it can be used with monitoring agents that support HTTP scraping, such as Prometheus, to visualize the data.
    65  
    66  ### Enable Telemetry
    67  
    68  Finally, Consul can be configured to send telemetry data to a remote monitoring system. This allows you to monitor the health of agents over time, spot trends, and plan for future needs. You will need a monitoring agent and console for this. 
    69  
    70  Consul supports the following telemetry agents:
    71  * Circonus 
    72  * DataDog (via `dogstatsd`)
    73  * StatsD (via `statsd`, `statsite`, `telegraf`, etc.)
    74  
    75  If you are using StatsD, you will also need a compatible database and server, such as Grafana, Chronograf, or Prometheus.
    76  
    77  Telemetry can be enabled in the agent configuration file, for example `server.hcl`. Telemetry can be enabled on any agent, client or server. Normally, you would at least enable it on all the servers (both voting and non-voting) to monitor the health of the entire cluster. 
    78  
    79  An example snippet of `server.hcl` to send telemetry to DataDog looks like this:
    80  
    81  ```json
    82    "telemetry": {
    83      "dogstatsd_addr": "localhost:8125",
    84      "disable_hostname": true
    85    }
    86  ```
    87  
    88  When enabling telemetry on an existing cluster, the Consul process will need to be reloaded. This can be done with `consul reload` or `kill -HUP <process_id>`. It is recommended to reload the servers one at a time, starting with the non-leaders. 
    89  
    90  ## Consul Health
    91  
    92  The Consul health metrics reveal information about the Consul cluster. They include performance metrics for the key value store, transactions, raft, leadership changes, autopilot tuning, and garbage collection. 
    93  
    94  ### Transaction Timing
    95  
    96  The following metrics indicate how long it takes to complete write operations
    97  in various parts, including Consul KV and Raft from the Consul server. Generally, these values should remain reasonably consistent and no more than a few milliseconds each. 
    98  
    99  | Metric Name              | Description |
   100  | :----------------------- | :---------- |
   101  | `consul.kvs.apply`       | Measures the time it takes to complete an update to the KV store. |
   102  | `consul.txn.apply`       | Measures the time spent applying a transaction operation. |
   103  | `consul.raft.apply`      | Counts the number of Raft transactions occurring over the interval. |
   104  | `consul.raft.commitTime` | Measures the time it takes to commit a new entry to the Raft log on the leader. |
   105  
   106  Sudden changes in any of the timing values could be due to unexpected load on the Consul servers or due to problems on the hosts themselves. Specifically, if any of these metrics deviate more than 50% from the baseline over the previous hour, this indicates an issue. Below are examples of healthy transaction metrics.
   107  
   108  ```sh
   109  'consul.raft.apply': Count: 1 Sum: 1.000 LastUpdated: 2018-11-16 10:55:03.673805766 -0600 CST m=+97598.238246167
   110  'consul.raft.commitTime': Count: 1 Sum: 0.017 LastUpdated: 2018-11-16 10:55:03.673840104 -0600 CST m=+97598.238280505
   111  ```
   112  
   113  ### Leadership Changes
   114  
   115  In a healthy environment, your Consul cluster should have a stable leader. There shouldn’t be any leadership changes unless you manually change leadership (by taking a server out of the cluster, for example). If there are unexpected elections or leadership changes, you should investigate possible network issues between the Consul servers. Another possible cause could be that the Consul servers are unable to keep up with the transaction load. 
   116  
   117  Note: These metrics are reported by the follower nodes, not by the leader.
   118  
   119  | Metric Name | Description |
   120  | :---------- | :---------- |
   121  | `consul.raft.leader.lastContact` | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. |
   122  | `consul.raft.state.candidate` | Increments when a Consul server starts an election process. |
   123  | `consul.raft.state.leader` | Increments when a Consul server becomes a leader. |
   124  
   125  If the `candidate` or `leader` metrics are greater than 0 or the `lastContact` metric is greater than 200ms, you should look into one of the possible causes described above. Below are examples of healthy leadership metrics. 
   126  
   127  ```sh
   128  'consul.raft.leader.lastContact': Count: 4 Min: 10.000 Mean: 31.000 Max: 50.000 Stddev: 17.088 Sum: 124.000 LastUpdated: 2018-12-17 22:06:08.872973122 +0000 UTC m=+3553.639379498
   129  'consul.raft.state.leader': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.104580236 +0000 UTC m=+3533.870986584
   130  'consul.raft.state.candidate': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.097186444 +0000 UTC m=+3533.863592815
   131  ```
   132  
   133  ### Autopilot
   134  
   135  The autopilot metric is a boolean. A value of 1 indicates a healthy cluster and 0 indicates an unhealthy state.
   136  
   137  | Metric Name | Description |
   138  | :---------- | :---------- |
   139  | `consul.autopilot.healthy` | Tracks the overall health of the local server cluster. If all servers are considered healthy by autopilot, this will be set to 1. If any are unhealthy, this will be 0. |
   140  
   141  An alert should be setup for a returned value of 0. Below is an example of a healthy cluster according to the autopilot metric.
   142  
   143  ```sh
   144  [2018-12-17 13:03:40 -0500 EST][G] 'consul.autopilot.healthy': 1.000
   145  ```
   146  
   147  ### Garbage Collection
   148  
   149  Garbage collection (GC) pauses are a "stop-the-world" event, all runtime threads are blocked until GC completes. In a healthy environment these pauses should only last a few nanoseconds. If memory usage is high, the Go runtime may start the GC process so frequently that it will slow down Consul. You might observe more frequent leader elections or longer write times.
   150  
   151  | Metric Name | Description |
   152  | :---------- | :---------- |
   153  | `consul.runtime.total_gc_pause_ns` | Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started. |
   154  
   155  If the value return is more than 2 seconds/minute, you should start investigating the cause. If it exceeds 5 seconds per minute, you should consider the cluster to be in a critical state and start ensuring failure recovery procedures are up-to-date and start investigating. Below is an example of healthy GC pause.
   156  
   157  ```sh
   158  'consul.runtime.total_gc_pause_ns': 136603664.000
   159  ```
   160  
   161  Note, `total_gc_pause_ns` is a cumulative counter, so in order to calculate rates, such as GC/minute, you will need to apply a function such as [non_negative_difference](https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference).
   162  
   163  ## Server Health 
   164  
   165  The server metrics provide information about the health of your cluster including file handles, CPU usage, network activity, disk activity, and memory usage. 
   166  
   167  ### File Descriptors
   168  
   169  The majority of Consul operations require a file descriptor handle, including receiving a connection from another host, sending data between servers, and writing snapshots to disk. If Consul runs out of handles, it will stop accepting connections. 
   170  
   171  | Metric Name | Description |
   172  | :---------- | :---------- |
   173  | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
   174  | `linux_sysctl_fs.file-max` | Total number of available file handles. |
   175  
   176  By default, process and kernel limits are conservative, you may want to increase the limits beyond the defaults. If  the `linux_sysctl_fs.file-nr` value exceeds 80% of `linux_sysctl_fs.file-max`, the file handles should be increased. Below is an example of a file handle metric.
   177  
   178  ```sh
   179  linux_sysctl_fs, host=statsbox, file-nr=768i, file-max=96763i 
   180  ```
   181  
   182  ### CPU Usage
   183  
   184  Consul should not be demanding of CPU time on either server or clients. A spike in CPU usage could indicate too many operations taking place at once.
   185  
   186  | Metric Name | Description |
   187  | :---------- | :---------- |
   188  | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Vault or Consul). |
   189  | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |
   190  
   191  If `cpu.iowait_cpu` is greater than 10%, it should be considered critical as Consul is waiting for data to be written to disk. This could be a sign that Raft is writing snapshots to disk too often. Below is an example of a healthy CPU metric.
   192  
   193  ```sh
   194  cpu, cpu=cpu-total, usage_idle=99.298, usage_user=0.400, usage_system=0.300, usage_iowait=0, usage_steal=0 
   195  ```
   196  
   197  ### Network Activity
   198  
   199  Network activity should be consistent. A sudden spike in network traffic to Consul might be the result of a misconfigured client, such as Vault, that is causing too many requests.
   200  
   201  Most agents will report separate metrics for each network interface, so be sure you are monitoring the right one.
   202  
   203  | Metric Name | Description |
   204  | :---------- | :---------- |
   205  | `net.bytes_recv` | Bytes received on each network interface. |
   206  | `net.bytes_sent` | Bytes transmitted on each network interface. |
   207  
   208  Sudden increases to the `net` metrics, greater than 50% deviation from baseline, indicates too many requests that are not being handled. Below is an example of a network activity metric.
   209  
   210  ```sh
   211  net, interface=enp0s5, bytes_sent=6183357i, bytes_recv=262313256i
   212  ```
   213  
   214  Note: The `net` metrics are counters, so in order to calculate rates, such as bytes/second,
   215  you will need to apply a function such as [non_negative_difference](https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference).
   216  
   217  ### Disk Activity
   218  
   219  Normally, there is low disk activity, because Consul keeps everything in memory. If the Consul host is writing a large amount of data to disk, it could mean that Consul is under heavy write load and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance. 
   220  
   221  | Metric Name | Description |
   222  | :---------- | :---------- |
   223  | `diskio.read_bytes` | Bytes read from each block device. |
   224  | `diskio.write_bytes` | Bytes written to each block device. |
   225  | `diskio.read_time` | Time spent reading from disk, in cumulative milliseconds. |
   226  | `diskio.write_time` | Time spent writing to disk, in cumulative milliseconds. |
   227  
   228  
   229  Sudden, large changes to the `diskio` metrics, greater than 50% deviation from baseline
   230  or more than 3 standard deviations from baseline indicates Consul has too much disk I/O. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. Below are examples of disk activity metrics.
   231  
   232  ```sh
   233  diskio, name=sda5, read_bytes=522298368i,  write_bytes=1726865408i, read_time=7248i, write_time=133364i
   234  ```
   235  
   236  Note: The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second),you will need to apply a function such as [non_negative_difference][].
   237  
   238  ### Memory Usage
   239  
   240  As noted previously, Consul keeps all of its data -- the KV store, the catalog, etc -- in memory. If Consul consumes all available memory, it will crash. You should monitor total available RAM to make sure some RAM is available for other system processes and swap usage should remain at 0% for best performance.
   241  
   242  | Metric Name | Description |
   243  | :---------- | :---------- |
   244  | `consul.runtime.alloc_bytes` | Measures the number of bytes allocated by the Consul process. |
   245  | `consul.runtime.sys_bytes`   | The total number of bytes of memory obtained from the OS.  |
   246  | `mem.total`                  | Total amount of physical memory (RAM) available on the server.     |
   247  | `mem.used_percent`           | Percentage of physical memory in use. |
   248  | `swap.used_percent`          | Percentage of swap space in use. |
   249  
   250  Consul servers are running low on memory if `sys_bytes` exceeds 90% of `total_bytes`, `mem.used_percent` is over 90%, or `swap.used_percent` is greater than 0. You should increase the memory available to Consul if any of these three conditions are met. Below are examples of memory usage metrics.
   251  
   252  ```sh
   253  'consul.runtime.alloc_bytes': 11199928.000
   254  'consul.runtime.sys_bytes': 24627448.000
   255  mem,  used_percent=31.492,  total=1036312576i
   256  swap, used_percent=1.343
   257  ```
   258    
   259  ## Summary
   260  
   261  In this guide we reviewed the three methods for collecting metrics. SIGUSR1 and agent HTTP API are both quick methods for collecting metrics, but enabling telemetry is the best method for moving data into monitoring software. Additionally, we outlined the various metrics collected and their significance.
   262