go.etcd.io/etcd@v3.3.27+incompatible/Documentation/op-guide/monitoring.md (about) 1 --- 2 title: Monitoring etcd 3 --- 4 5 Each etcd server provides local monitoring information on its client port through http endpoints. The monitoring data is useful for both system health checking and cluster debugging. 6 7 ## Debug endpoint 8 9 If `--debug` is set, the etcd server exports debugging information on its client port under the `/debug` path. Take care when setting `--debug`, since there will be degraded performance and verbose logging. 10 11 The `/debug/pprof` endpoint is the standard go runtime profiling endpoint. This can be used to profile CPU, heap, mutex, and goroutine utilization. For example, here `go tool pprof` gets the top 10 functions where etcd spends its time: 12 13 ```sh 14 $ go tool pprof http://localhost:2379/debug/pprof/profile 15 Fetching profile from http://localhost:2379/debug/pprof/profile 16 Please wait... (30s) 17 Saved profile in /home/etcd/pprof/pprof.etcd.localhost:2379.samples.cpu.001.pb.gz 18 Entering interactive mode (type "help" for commands) 19 (pprof) top10 20 310ms of 480ms total (64.58%) 21 Showing top 10 nodes out of 157 (cum >= 10ms) 22 flat flat% sum% cum cum% 23 130ms 27.08% 27.08% 130ms 27.08% runtime.futex 24 70ms 14.58% 41.67% 70ms 14.58% syscall.Syscall 25 20ms 4.17% 45.83% 20ms 4.17% github.com/coreos/etcd/vendor/golang.org/x/net/http2/hpack.huffmanDecode 26 20ms 4.17% 50.00% 30ms 6.25% runtime.pcvalue 27 20ms 4.17% 54.17% 50ms 10.42% runtime.schedule 28 10ms 2.08% 56.25% 10ms 2.08% github.com/coreos/etcd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).AuthInfoFromCtx 29 10ms 2.08% 58.33% 10ms 2.08% github.com/coreos/etcd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Lead 30 10ms 2.08% 60.42% 10ms 2.08% github.com/coreos/etcd/vendor/github.com/coreos/etcd/pkg/wait.(*timeList).Trigger 31 10ms 2.08% 62.50% 10ms 2.08% github.com/coreos/etcd/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).hashLabelValues 32 10ms 2.08% 64.58% 10ms 2.08% github.com/coreos/etcd/vendor/golang.org/x/net/http2.(*Framer).WriteHeaders 33 ``` 34 35 The `/debug/requests` endpoint gives gRPC traces and performance statistics through a web browser. For example, here is a `Range` request for the key `abc`: 36 37 ``` 38 When Elapsed (s) 39 2017/08/18 17:34:51.999317 0.000244 /etcdserverpb.KV/Range 40 17:34:51.999382 . 65 ... RPC: from 127.0.0.1:47204 deadline:4.999377747s 41 17:34:51.999395 . 13 ... recv: key:"abc" 42 17:34:51.999499 . 104 ... OK 43 17:34:51.999535 . 36 ... sent: header:<cluster_id:14841639068965178418 member_id:10276657743932975437 revision:15 raft_term:17 > kvs:<key:"abc" create_revision:6 mod_revision:14 version:9 value:"asda" > count:1 44 ``` 45 46 ## Metrics endpoint 47 48 Each etcd server exports metrics under the `/metrics` path on its client port and optionally on locations given by `--listen-metrics-urls`. 49 50 The metrics can be fetched with `curl`: 51 52 ```sh 53 $ curl -L http://localhost:2379/metrics | grep -v debugging # ignore unstable debugging metrics 54 55 # HELP etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend. 56 # TYPE etcd_disk_backend_commit_duration_seconds histogram 57 etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 72756 58 etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 401587 59 etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 405979 60 etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 406464 61 ... 62 ``` 63 64 ## Health Check 65 66 Since v3.3.0, in addition to responding to the `/metrics` endpoint, any locations specified by `--listen-metrics-urls` will also respond to the `/health` endpoint. This can be useful if the standard endpoint is configured with mutual (client) TLS authentication, but a load balancer or monitoring service still needs access to the health check. 67 68 ## Prometheus 69 70 Running a [Prometheus][prometheus] monitoring service is the easiest way to ingest and record etcd's metrics. 71 72 First, install Prometheus: 73 74 ```sh 75 PROMETHEUS_VERSION="2.0.0" 76 wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz -O /tmp/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz 77 tar -xvzf /tmp/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz --directory /tmp/ --strip-components=1 78 /tmp/prometheus -version 79 ``` 80 81 Set Prometheus's scraper to target the etcd cluster endpoints: 82 83 ```sh 84 cat > /tmp/test-etcd.yaml <<EOF 85 global: 86 scrape_interval: 10s 87 scrape_configs: 88 - job_name: test-etcd 89 static_configs: 90 - targets: ['10.240.0.32:2379','10.240.0.33:2379','10.240.0.34:2379'] 91 EOF 92 cat /tmp/test-etcd.yaml 93 ``` 94 95 Set up the Prometheus handler: 96 97 ```sh 98 nohup /tmp/prometheus \ 99 -config.file /tmp/test-etcd.yaml \ 100 -web.listen-address ":9090" \ 101 -storage.local.path "test-etcd.data" >> /tmp/test-etcd.log 2>&1 & 102 ``` 103 104 Now Prometheus will scrape etcd metrics every 10 seconds. 105 106 107 ### Alerting 108 109 There is a set of default alerts for etcd v3 clusters for [Prometheus 1.x](./etcd3_alert.rules) as well as [Prometheus 2.x](./etcd3_alert.rules.yml). 110 111 > Note: `job` labels may need to be adjusted to fit a particular need. The rules were written to apply to a single cluster so it is recommended to choose labels unique to a cluster. 112 113 ### Grafana 114 115 [Grafana][grafana] has built-in Prometheus support; just add a Prometheus data source: 116 117 ``` 118 Name: test-etcd 119 Type: Prometheus 120 Url: http://localhost:9090 121 Access: proxy 122 ``` 123 124 Then import the default [etcd dashboard template][template] and customize. For instance, if Prometheus data source name is `my-etcd`, the `datasource` field values in JSON also need to be `my-etcd`. 125 126 Sample dashboard: 127 128 ![](./etcd-sample-grafana.png) 129 130 131 [prometheus]: https://prometheus.io/ 132 [grafana]: http://grafana.org/ 133 [template]: ./grafana.json