go.etcd.io/etcd@v3.3.27+incompatible/Documentation/op-guide/monitoring.md

go.etcd.io/etcd@v3.3.27+incompatible/Documentation/op-guide/monitoring.md (about)

     1  ---
     2  title: Monitoring etcd
     3  ---
     4  
     5  Each etcd server provides local monitoring information on its client port through http endpoints. The monitoring data is useful for both system health checking and cluster debugging.
     6  
     7  ## Debug endpoint
     8  
     9  If `--debug` is set, the etcd server exports debugging information on its client port under the `/debug` path. Take care when setting `--debug`, since there will be degraded performance and verbose logging.
    10  
    11  The `/debug/pprof` endpoint is the standard go runtime profiling endpoint. This can be used to profile CPU, heap, mutex, and goroutine utilization. For example, here `go tool pprof` gets the top 10 functions where etcd spends its time:
    12  
    13  ```sh
    14  $ go tool pprof http://localhost:2379/debug/pprof/profile
    15  Fetching profile from http://localhost:2379/debug/pprof/profile
    16  Please wait... (30s)
    17  Saved profile in /home/etcd/pprof/pprof.etcd.localhost:2379.samples.cpu.001.pb.gz
    18  Entering interactive mode (type "help" for commands)
    19  (pprof) top10
    20  310ms of 480ms total (64.58%)
    21  Showing top 10 nodes out of 157 (cum >= 10ms)
    22      flat  flat%   sum%        cum   cum%
    23     130ms 27.08% 27.08%      130ms 27.08%  runtime.futex
    24      70ms 14.58% 41.67%       70ms 14.58%  syscall.Syscall
    25      20ms  4.17% 45.83%       20ms  4.17%  github.com/coreos/etcd/vendor/golang.org/x/net/http2/hpack.huffmanDecode
    26      20ms  4.17% 50.00%       30ms  6.25%  runtime.pcvalue
    27      20ms  4.17% 54.17%       50ms 10.42%  runtime.schedule
    28      10ms  2.08% 56.25%       10ms  2.08%  github.com/coreos/etcd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).AuthInfoFromCtx
    29      10ms  2.08% 58.33%       10ms  2.08%  github.com/coreos/etcd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Lead
    30      10ms  2.08% 60.42%       10ms  2.08%  github.com/coreos/etcd/vendor/github.com/coreos/etcd/pkg/wait.(*timeList).Trigger
    31      10ms  2.08% 62.50%       10ms  2.08%  github.com/coreos/etcd/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).hashLabelValues
    32      10ms  2.08% 64.58%       10ms  2.08%  github.com/coreos/etcd/vendor/golang.org/x/net/http2.(*Framer).WriteHeaders
    33  ```
    34  
    35  The `/debug/requests` endpoint gives gRPC traces and performance statistics through a web browser. For example, here is a `Range` request for the key `abc`:
    36  
    37  ```
    38  When	Elapsed (s)
    39  2017/08/18 17:34:51.999317 	0.000244 	/etcdserverpb.KV/Range
    40  17:34:51.999382 	 .    65 	... RPC: from 127.0.0.1:47204 deadline:4.999377747s
    41  17:34:51.999395 	 .    13 	... recv: key:"abc"
    42  17:34:51.999499 	 .   104 	... OK
    43  17:34:51.999535 	 .    36 	... sent: header:<cluster_id:14841639068965178418 member_id:10276657743932975437 revision:15 raft_term:17 > kvs:<key:"abc" create_revision:6 mod_revision:14 version:9 value:"asda" > count:1
    44  ```
    45  
    46  ## Metrics endpoint
    47  
    48  Each etcd server exports metrics under the `/metrics` path on its client port and optionally on locations given by `--listen-metrics-urls`.
    49  
    50  The metrics can be fetched with `curl`:
    51  
    52  ```sh
    53  $ curl -L http://localhost:2379/metrics | grep -v debugging # ignore unstable debugging metrics
    54  
    55  # HELP etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend.
    56  # TYPE etcd_disk_backend_commit_duration_seconds histogram
    57  etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 72756
    58  etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 401587
    59  etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 405979
    60  etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 406464
    61  ...
    62  ```
    63  
    64  ## Health Check
    65  
    66  Since v3.3.0, in addition to responding to the `/metrics` endpoint, any locations specified by `--listen-metrics-urls` will also respond to the `/health` endpoint. This can be useful if the standard endpoint is configured with mutual (client) TLS authentication, but a load balancer or monitoring service still needs access to the health check.
    67  
    68  ## Prometheus
    69  
    70  Running a [Prometheus][prometheus] monitoring service is the easiest way to ingest and record etcd's metrics.
    71  
    72  First, install Prometheus:
    73  
    74  ```sh
    75  PROMETHEUS_VERSION="2.0.0"
    76  wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz -O /tmp/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
    77  tar -xvzf /tmp/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz --directory /tmp/ --strip-components=1
    78  /tmp/prometheus -version
    79  ```
    80  
    81  Set Prometheus's scraper to target the etcd cluster endpoints:
    82  
    83  ```sh
    84  cat > /tmp/test-etcd.yaml <<EOF
    85  global:
    86    scrape_interval: 10s
    87  scrape_configs:
    88    - job_name: test-etcd
    89      static_configs:
    90      - targets: ['10.240.0.32:2379','10.240.0.33:2379','10.240.0.34:2379']
    91  EOF
    92  cat /tmp/test-etcd.yaml
    93  ```
    94  
    95  Set up the Prometheus handler:
    96  
    97  ```sh
    98  nohup /tmp/prometheus \
    99      -config.file /tmp/test-etcd.yaml \
   100      -web.listen-address ":9090" \
   101      -storage.local.path "test-etcd.data" >> /tmp/test-etcd.log  2>&1 &
   102  ```
   103  
   104  Now Prometheus will scrape etcd metrics every 10 seconds.
   105  
   106  
   107  ### Alerting
   108  
   109  There is a set of default alerts for etcd v3 clusters for [Prometheus 1.x](./etcd3_alert.rules) as well as [Prometheus 2.x](./etcd3_alert.rules.yml).
   110  
   111  > Note: `job` labels may need to be adjusted to fit a particular need. The rules were written to apply to a single cluster so it is recommended to choose labels unique to a cluster.
   112  
   113  ### Grafana
   114  
   115  [Grafana][grafana] has built-in Prometheus support; just add a Prometheus data source:
   116  
   117  ```
   118  Name:   test-etcd
   119  Type:   Prometheus
   120  Url:    http://localhost:9090
   121  Access: proxy
   122  ```
   123  
   124  Then import the default [etcd dashboard template][template] and customize. For instance, if Prometheus data source name is `my-etcd`, the `datasource` field values in JSON also need to be `my-etcd`.
   125  
   126  Sample dashboard:
   127  
   128  ![](./etcd-sample-grafana.png)
   129  
   130  
   131  [prometheus]: https://prometheus.io/
   132  [grafana]: http://grafana.org/
   133  [template]: ./grafana.json