github.com/netdata/go.d.plugin@v0.58.1/modules/consul/integrations/consul.md (about)

     1  <!--startmeta
     2  custom_edit_url: "https://github.com/netdata/go.d.plugin/edit/master/modules/consul/README.md"
     3  meta_yaml: "https://github.com/netdata/go.d.plugin/edit/master/modules/consul/metadata.yaml"
     4  sidebar_label: "Consul"
     5  learn_status: "Published"
     6  learn_rel_path: "Data Collection/Service Discovery / Registry"
     7  most_popular: True
     8  message: "DO NOT EDIT THIS FILE DIRECTLY, IT IS GENERATED BY THE COLLECTOR'S metadata.yaml FILE"
     9  endmeta-->
    10  
    11  # Consul
    12  
    13  
    14  <img src="https://netdata.cloud/img/consul.svg" width="150"/>
    15  
    16  
    17  Plugin: go.d.plugin
    18  Module: consul
    19  
    20  <img src="https://img.shields.io/badge/maintained%20by-Netdata-%2300ab44" />
    21  
    22  ## Overview
    23  
    24  This collector monitors [key metrics](https://developer.hashicorp.com/consul/docs/agent/telemetry#key-metrics) of Consul Agents: transaction timings, leadership changes, memory usage and more.
    25  
    26  
    27  It periodically sends HTTP requests to [Consul REST API](https://developer.hashicorp.com/consul/api-docs).
    28  
    29  Used endpoints:
    30  
    31  - [/operator/autopilot/health](https://developer.hashicorp.com/consul/api-docs/operator/autopilot#read-health)
    32  - [/agent/checks](https://developer.hashicorp.com/consul/api-docs/agent/check#list-checks)
    33  - [/agent/self](https://developer.hashicorp.com/consul/api-docs/agent#read-configuration)
    34  - [/agent/metrics](https://developer.hashicorp.com/consul/api-docs/agent#view-metrics)
    35  - [/coordinate/nodes](https://developer.hashicorp.com/consul/api-docs/coordinate#read-lan-coordinates-for-all-nodes)
    36  
    37  
    38  This collector is supported on all platforms.
    39  
    40  This collector supports collecting metrics from multiple instances of this integration, including remote instances.
    41  
    42  
    43  ### Default Behavior
    44  
    45  #### Auto-Detection
    46  
    47  This collector discovers instances running on the local host, that provide metrics on port 8500.
    48  
    49  On startup, it tries to collect metrics from:
    50  
    51  - http://localhost:8500
    52  - http://127.0.0.1:8500
    53  
    54  
    55  #### Limits
    56  
    57  The default configuration for this integration does not impose any limits on data collection.
    58  
    59  #### Performance Impact
    60  
    61  The default configuration for this integration is not expected to impose a significant performance impact on the system.
    62  
    63  
    64  ## Metrics
    65  
    66  Metrics grouped by *scope*.
    67  
    68  The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.
    69  
    70  The set of metrics depends on the [Consul Agent mode](https://developer.hashicorp.com/consul/docs/install/glossary#agent).
    71  
    72  
    73  ### Per Consul instance
    74  
    75  These metrics refer to the entire monitored application.
    76  
    77  This scope has no labels.
    78  
    79  Metrics:
    80  
    81  | Metric | Dimensions | Unit | Leader | Follower | Client |
    82  |:------|:----------|:----|:---:|:---:|:---:|
    83  | consul.client_rpc_requests_rate | rpc | requests/s | • | • | • |
    84  | consul.client_rpc_requests_exceeded_rate | exceeded | requests/s | • | • | • |
    85  | consul.client_rpc_requests_failed_rate | failed | requests/s | • | • | • |
    86  | consul.memory_allocated | allocated | bytes | • | • | • |
    87  | consul.memory_sys | sys | bytes | • | • | • |
    88  | consul.gc_pause_time | gc_pause | seconds | • | • | • |
    89  | consul.kvs_apply_time | quantile_0.5, quantile_0.9, quantile_0.99 | ms | • | • |   |
    90  | consul.kvs_apply_operations_rate | kvs_apply | ops/s | • | • |   |
    91  | consul.txn_apply_time | quantile_0.5, quantile_0.9, quantile_0.99 | ms | • | • |   |
    92  | consul.txn_apply_operations_rate | txn_apply | ops/s | • | • |   |
    93  | consul.autopilot_health_status | healthy, unhealthy | status | • | • |   |
    94  | consul.autopilot_failure_tolerance | failure_tolerance | servers | • | • |   |
    95  | consul.autopilot_server_health_status | healthy, unhealthy | status | • | • |   |
    96  | consul.autopilot_server_stable_time | stable | seconds | • | • |   |
    97  | consul.autopilot_server_serf_status | active, failed, left, none | status | • | • |   |
    98  | consul.autopilot_server_voter_status | voter, not_voter | status | • | • |   |
    99  | consul.network_lan_rtt | min, max, avg | ms | • | • |   |
   100  | consul.raft_commit_time | quantile_0.5, quantile_0.9, quantile_0.99 | ms | • |   |   |
   101  | consul.raft_commits_rate | commits | commits/s | • |   |   |
   102  | consul.raft_leader_last_contact_time | quantile_0.5, quantile_0.9, quantile_0.99 | ms | • |   |   |
   103  | consul.raft_leader_oldest_log_age | oldest_log_age | seconds | • |   |   |
   104  | consul.raft_follower_last_contact_leader_time | leader_last_contact | ms |   | • |   |
   105  | consul.raft_rpc_install_snapshot_time | quantile_0.5, quantile_0.9, quantile_0.99 | ms |   | • |   |
   106  | consul.raft_leader_elections_rate | leader | elections/s | • | • |   |
   107  | consul.raft_leadership_transitions_rate | leadership | transitions/s | • | • |   |
   108  | consul.server_leadership_status | leader, not_leader | status | • | • |   |
   109  | consul.raft_thread_main_saturation_perc | quantile_0.5, quantile_0.9, quantile_0.99 | percentage | • | • |   |
   110  | consul.raft_thread_fsm_saturation_perc | quantile_0.5, quantile_0.9, quantile_0.99 | percentage | • | • |   |
   111  | consul.raft_fsm_last_restore_duration | last_restore_duration | ms | • | • |   |
   112  | consul.raft_boltdb_freelist_bytes | freelist | bytes | • | • |   |
   113  | consul.raft_boltdb_logs_per_batch_rate | written | logs/s | • | • |   |
   114  | consul.raft_boltdb_store_logs_time | quantile_0.5, quantile_0.9, quantile_0.99 | ms | • | • |   |
   115  | consul.license_expiration_time | license_expiration | seconds | • | • | • |
   116  
   117  ### Per node check
   118  
   119  Metrics about checks on Node level.
   120  
   121  Labels:
   122  
   123  | Label      | Description     |
   124  |:-----------|:----------------|
   125  | datacenter | Datacenter Identifier |
   126  | node_name | The node's name |
   127  | check_name | The check's name |
   128  
   129  Metrics:
   130  
   131  | Metric | Dimensions | Unit | Leader | Follower | Client |
   132  |:------|:----------|:----|:---:|:---:|:---:|
   133  | consul.node_health_check_status | passing, maintenance, warning, critical | status | • | • | • |
   134  
   135  ### Per service check
   136  
   137  Metrics about checks at a Service level.
   138  
   139  Labels:
   140  
   141  | Label      | Description     |
   142  |:-----------|:----------------|
   143  | datacenter | Datacenter Identifier |
   144  | node_name | The node's name |
   145  | check_name | The check's name |
   146  | service_name | The service's name |
   147  
   148  Metrics:
   149  
   150  | Metric | Dimensions | Unit | Leader | Follower | Client |
   151  |:------|:----------|:----|:---:|:---:|:---:|
   152  | consul.service_health_check_status | passing, maintenance, warning, critical | status | • | • | • |
   153  
   154  
   155  
   156  ## Alerts
   157  
   158  
   159  The following alerts are available:
   160  
   161  | Alert name  | On metric | Description |
   162  |:------------|:----------|:------------|
   163  | [ consul_node_health_check_status ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.node_health_check_status | node health check ${label:check_name} has failed on server ${label:node_name} datacenter ${label:datacenter} |
   164  | [ consul_service_health_check_status ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.service_health_check_status | service health check ${label:check_name} for service ${label:service_name} has failed on server ${label:node_name} datacenter ${label:datacenter} |
   165  | [ consul_client_rpc_requests_exceeded ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.client_rpc_requests_exceeded_rate | number of rate-limited RPC requests made by server ${label:node_name} datacenter ${label:datacenter} |
   166  | [ consul_client_rpc_requests_failed ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.client_rpc_requests_failed_rate | number of failed RPC requests made by server ${label:node_name} datacenter ${label:datacenter} |
   167  | [ consul_gc_pause_time ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.gc_pause_time | time spent in stop-the-world garbage collection pauses on server ${label:node_name} datacenter ${label:datacenter} |
   168  | [ consul_autopilot_health_status ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.autopilot_health_status | datacenter ${label:datacenter} cluster is unhealthy as reported by server ${label:node_name} |
   169  | [ consul_autopilot_server_health_status ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.autopilot_server_health_status | server ${label:node_name} from datacenter ${label:datacenter} is unhealthy |
   170  | [ consul_raft_leader_last_contact_time ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.raft_leader_last_contact_time | median time elapsed since leader server ${label:node_name} datacenter ${label:datacenter} was last able to contact the follower nodes |
   171  | [ consul_raft_leadership_transitions ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.raft_leadership_transitions_rate | there has been a leadership change and server ${label:node_name} datacenter ${label:datacenter} has become the leader |
   172  | [ consul_raft_thread_main_saturation ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.raft_thread_main_saturation_perc | average saturation of the main Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} |
   173  | [ consul_raft_thread_fsm_saturation ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.raft_thread_fsm_saturation_perc | average saturation of the FSM Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} |
   174  | [ consul_license_expiration_time ](https://github.com/netdata/netdata/blob/master/health/health.d/consul.conf) | consul.license_expiration_time | Consul Enterprise licence expiration time on node ${label:node_name} datacenter ${label:datacenter} |
   175  
   176  
   177  ## Setup
   178  
   179  ### Prerequisites
   180  
   181  #### Enable Prometheus telemetry
   182  
   183  [Enable](https://developer.hashicorp.com/consul/docs/agent/config/config-files#telemetry-prometheus_retention_time) telemetry on your Consul agent, by increasing the value of `prometheus_retention_time` from `0`.
   184  
   185  
   186  #### Add required ACLs to Token
   187  
   188  Required **only if authentication is enabled**.
   189  
   190  |       ACL       | Endpoint                                                                                                                                                                                                                                                                                       |
   191  |:---------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   192  | `operator:read` | [autopilot health status](https://developer.hashicorp.com/consul/api-docs/operator/autopilot#read-health)                                                                                                                                                                                      |
   193  |   `node:read`   | [checks](https://developer.hashicorp.com/consul/api-docs/agent/check#list-checks)                                                                                                                                                                                                              |
   194  |  `agent:read`   | [configuration](https://developer.hashicorp.com/consul/api-docs/agent#read-configuration), [metrics](https://developer.hashicorp.com/consul/api-docs/agent#view-metrics), and [lan coordinates](https://developer.hashicorp.com/consul/api-docs/coordinate#read-lan-coordinates-for-all-nodes) |
   195  
   196  
   197  
   198  ### Configuration
   199  
   200  #### File
   201  
   202  The configuration file name for this integration is `go.d/consul.conf`.
   203  
   204  
   205  You can edit the configuration file using the `edit-config` script from the
   206  Netdata [config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#the-netdata-config-directory).
   207  
   208  ```bash
   209  cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
   210  sudo ./edit-config go.d/consul.conf
   211  ```
   212  #### Options
   213  
   214  The following options can be defined globally: update_every, autodetection_retry.
   215  
   216  
   217  <details><summary>All options</summary>
   218  
   219  | Name | Description | Default | Required |
   220  |:----|:-----------|:-------|:--------:|
   221  | update_every | Data collection frequency. | 1 | no |
   222  | autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no |
   223  | url | Server URL. | http://localhost:8500 | yes |
   224  | acl_token | ACL token used in every request. |  | no |
   225  | max_checks | Checks processing/charting limit. |  | no |
   226  | max_filter | Checks processing/charting filter. Uses [simple patterns](https://github.com/netdata/netdata/blob/master/src/libnetdata/simple_pattern/README.md). |  | no |
   227  | username | Username for basic HTTP authentication. |  | no |
   228  | password | Password for basic HTTP authentication. |  | no |
   229  | proxy_url | Proxy URL. |  | no |
   230  | proxy_username | Username for proxy basic HTTP authentication. |  | no |
   231  | proxy_password | Password for proxy basic HTTP authentication. |  | no |
   232  | timeout | HTTP request timeout. | 1 | no |
   233  | method | HTTP request method. | GET | no |
   234  | body | HTTP request body. |  | no |
   235  | headers | HTTP request headers. |  | no |
   236  | not_follow_redirects | Redirect handling policy. Controls whether the client follows redirects. | no | no |
   237  | tls_skip_verify | Server certificate chain and hostname validation policy. Controls whether the client performs this check. | no | no |
   238  | tls_ca | Certification authority that the client uses when verifying the server's certificates. |  | no |
   239  | tls_cert | Client tls certificate. |  | no |
   240  | tls_key | Client tls key. |  | no |
   241  
   242  </details>
   243  
   244  #### Examples
   245  
   246  ##### Basic
   247  
   248  An example configuration.
   249  
   250  ```yaml
   251  jobs:
   252    - name: local
   253      url: http://127.0.0.1:8500
   254      acl_token: "ec15675e-2999-d789-832e-8c4794daa8d7"
   255  
   256  ```
   257  ##### Basic HTTP auth
   258  
   259  Local server with basic HTTP authentication.
   260  
   261  <details><summary>Config</summary>
   262  
   263  ```yaml
   264  jobs:
   265    - name: local
   266      url: http://127.0.0.1:8500
   267      acl_token: "ec15675e-2999-d789-832e-8c4794daa8d7"
   268      username: foo
   269      password: bar
   270  
   271  ```
   272  </details>
   273  
   274  ##### Multi-instance
   275  
   276  > **Note**: When you define multiple jobs, their names must be unique.
   277  
   278  Collecting metrics from local and remote instances.
   279  
   280  
   281  <details><summary>Config</summary>
   282  
   283  ```yaml
   284  jobs:
   285    - name: local
   286      url: http://127.0.0.1:8500
   287      acl_token: "ec15675e-2999-d789-832e-8c4794daa8d7"
   288  
   289    - name: remote
   290      url: http://203.0.113.10:8500
   291      acl_token: "ada7f751-f654-8872-7f93-498e799158b6"
   292  
   293  ```
   294  </details>
   295  
   296  
   297  
   298  ## Troubleshooting
   299  
   300  ### Debug Mode
   301  
   302  To troubleshoot issues with the `consul` collector, run the `go.d.plugin` with the debug option enabled. The output
   303  should give you clues as to why the collector isn't working.
   304  
   305  - Navigate to the `plugins.d` directory, usually at `/usr/libexec/netdata/plugins.d/`. If that's not the case on
   306    your system, open `netdata.conf` and look for the `plugins` setting under `[directories]`.
   307  
   308    ```bash
   309    cd /usr/libexec/netdata/plugins.d/
   310    ```
   311  
   312  - Switch to the `netdata` user.
   313  
   314    ```bash
   315    sudo -u netdata -s
   316    ```
   317  
   318  - Run the `go.d.plugin` to debug the collector:
   319  
   320    ```bash
   321    ./go.d.plugin -d -m consul
   322    ```
   323  
   324