github.com/DFWallet/tendermint-cosmos@v0.0.2/docs/architecture/adr-011-monitoring.md (about) 1 # ADR 011: Monitoring 2 3 ## Changelog 4 5 08-06-2018: Initial draft 6 11-06-2018: Reorg after @xla comments 7 13-06-2018: Clarification about usage of labels 8 9 ## Context 10 11 In order to bring more visibility into Tendermint, we would like it to report 12 metrics and, maybe later, traces of transactions and RPC queries. See 13 https://github.com/DFWallet/tendermint-cosmos/issues/986. 14 15 A few solutions were considered: 16 17 1. [Prometheus](https://prometheus.io) 18 a) Prometheus API 19 b) [go-kit metrics package](https://github.com/go-kit/kit/tree/master/metrics) as an interface plus Prometheus 20 c) [telegraf](https://github.com/influxdata/telegraf) 21 d) new service, which will listen to events emitted by pubsub and report metrics 22 2. [OpenCensus](https://opencensus.io/introduction/) 23 24 ### 1. Prometheus 25 26 Prometheus seems to be the most popular product out there for monitoring. It has 27 a Go client library, powerful queries, alerts. 28 29 **a) Prometheus API** 30 31 We can commit to using Prometheus in Tendermint, but I think Tendermint users 32 should be free to choose whatever monitoring tool they feel will better suit 33 their needs (if they don't have existing one already). So we should try to 34 abstract interface enough so people can switch between Prometheus and other 35 similar tools. 36 37 **b) go-kit metrics package as an interface** 38 39 metrics package provides a set of uniform interfaces for service 40 instrumentation and offers adapters to popular metrics packages: 41 42 https://godoc.org/github.com/go-kit/kit/metrics#pkg-subdirectories 43 44 Comparing to Prometheus API, we're losing customisability and control, but gaining 45 freedom in choosing any instrument from the above list given we will extract 46 metrics creation into a separate function (see "providers" in node/node.go). 47 48 **c) telegraf** 49 50 Unlike already discussed options, telegraf does not require modifying Tendermint 51 source code. You create something called an input plugin, which polls 52 Tendermint RPC every second and calculates the metrics itself. 53 54 While it may sound good, but some metrics we want to report are not exposed via 55 RPC or pubsub, therefore can't be accessed externally. 56 57 **d) service, listening to pubsub** 58 59 Same issue as the above. 60 61 ### 2. opencensus 62 63 opencensus provides both metrics and tracing, which may be important in the 64 future. It's API looks different from go-kit and Prometheus, but looks like it 65 covers everything we need. 66 67 Unfortunately, OpenCensus go client does not define any 68 interfaces, so if we want to abstract away metrics we 69 will need to write interfaces ourselves. 70 71 ### List of metrics 72 73 | | Name | Type | Description | 74 | --- | ------------------------------------ | ------ | ----------------------------------------------------------------------------- | 75 | A | consensus_height | Gauge | | 76 | A | consensus_validators | Gauge | Number of validators who signed | 77 | A | consensus_validators_power | Gauge | Total voting power of all validators | 78 | A | consensus_missing_validators | Gauge | Number of validators who did not sign | 79 | A | consensus_missing_validators_power | Gauge | Total voting power of the missing validators | 80 | A | consensus_byzantine_validators | Gauge | Number of validators who tried to double sign | 81 | A | consensus_byzantine_validators_power | Gauge | Total voting power of the byzantine validators | 82 | A | consensus_block_interval | Timing | Time between this and last block (Block.Header.Time) | 83 | | consensus_block_time | Timing | Time to create a block (from creating a proposal to commit) | 84 | | consensus_time_between_blocks | Timing | Time between committing last block and (receiving proposal creating proposal) | 85 | A | consensus_rounds | Gauge | Number of rounds | 86 | | consensus_prevotes | Gauge | | 87 | | consensus_precommits | Gauge | | 88 | | consensus_prevotes_total_power | Gauge | | 89 | | consensus_precommits_total_power | Gauge | | 90 | A | consensus_num_txs | Gauge | | 91 | A | mempool_size | Gauge | | 92 | A | consensus_total_txs | Gauge | | 93 | A | consensus_block_size | Gauge | In bytes | 94 | A | p2p_peers | Gauge | Number of peers node's connected to | 95 96 `A` - will be implemented in the fist place. 97 98 **Proposed solution** 99 100 ## Status 101 102 Proposed. 103 104 ## Consequences 105 106 ### Positive 107 108 Better visibility, support of variety of monitoring backends 109 110 ### Negative 111 112 One more library to audit, messing metrics reporting code with business domain. 113 114 ### Neutral 115 116 -