github.com/DFWallet/tendermint-cosmos@v0.0.2/docs/architecture/adr-011-monitoring.md (about)

     1  # ADR 011: Monitoring
     2  
     3  ## Changelog
     4  
     5  08-06-2018: Initial draft
     6  11-06-2018: Reorg after @xla comments
     7  13-06-2018: Clarification about usage of labels
     8  
     9  ## Context
    10  
    11  In order to bring more visibility into Tendermint, we would like it to report
    12  metrics and, maybe later, traces of transactions and RPC queries. See
    13  https://github.com/DFWallet/tendermint-cosmos/issues/986.
    14  
    15  A few solutions were considered:
    16  
    17  1. [Prometheus](https://prometheus.io)
    18     a) Prometheus API
    19     b) [go-kit metrics package](https://github.com/go-kit/kit/tree/master/metrics) as an interface plus Prometheus
    20     c) [telegraf](https://github.com/influxdata/telegraf)
    21     d) new service, which will listen to events emitted by pubsub and report metrics
    22  2. [OpenCensus](https://opencensus.io/introduction/)
    23  
    24  ### 1. Prometheus
    25  
    26  Prometheus seems to be the most popular product out there for monitoring. It has
    27  a Go client library, powerful queries, alerts.
    28  
    29  **a) Prometheus API**
    30  
    31  We can commit to using Prometheus in Tendermint, but I think Tendermint users
    32  should be free to choose whatever monitoring tool they feel will better suit
    33  their needs (if they don't have existing one already). So we should try to
    34  abstract interface enough so people can switch between Prometheus and other
    35  similar tools.
    36  
    37  **b) go-kit metrics package as an interface**
    38  
    39  metrics package provides a set of uniform interfaces for service
    40  instrumentation and offers adapters to popular metrics packages:
    41  
    42  https://godoc.org/github.com/go-kit/kit/metrics#pkg-subdirectories
    43  
    44  Comparing to Prometheus API, we're losing customisability and control, but gaining
    45  freedom in choosing any instrument from the above list given we will extract
    46  metrics creation into a separate function (see "providers" in node/node.go).
    47  
    48  **c) telegraf**
    49  
    50  Unlike already discussed options, telegraf does not require modifying Tendermint
    51  source code. You create something called an input plugin, which polls
    52  Tendermint RPC every second and calculates the metrics itself.
    53  
    54  While it may sound good, but some metrics we want to report are not exposed via
    55  RPC or pubsub, therefore can't be accessed externally.
    56  
    57  **d) service, listening to pubsub**
    58  
    59  Same issue as the above.
    60  
    61  ### 2. opencensus
    62  
    63  opencensus provides both metrics and tracing, which may be important in the
    64  future. It's API looks different from go-kit and Prometheus, but looks like it
    65  covers everything we need.
    66  
    67  Unfortunately, OpenCensus go client does not define any
    68  interfaces, so if we want to abstract away metrics we
    69  will need to write interfaces ourselves.
    70  
    71  ### List of metrics
    72  
    73  |     | Name                                 | Type   | Description                                                                   |
    74  | --- | ------------------------------------ | ------ | ----------------------------------------------------------------------------- |
    75  | A   | consensus_height                     | Gauge  |                                                                               |
    76  | A   | consensus_validators                 | Gauge  | Number of validators who signed                                               |
    77  | A   | consensus_validators_power           | Gauge  | Total voting power of all validators                                          |
    78  | A   | consensus_missing_validators         | Gauge  | Number of validators who did not sign                                         |
    79  | A   | consensus_missing_validators_power   | Gauge  | Total voting power of the missing validators                                  |
    80  | A   | consensus_byzantine_validators       | Gauge  | Number of validators who tried to double sign                                 |
    81  | A   | consensus_byzantine_validators_power | Gauge  | Total voting power of the byzantine validators                                |
    82  | A   | consensus_block_interval             | Timing | Time between this and last block (Block.Header.Time)                          |
    83  |     | consensus_block_time                 | Timing | Time to create a block (from creating a proposal to commit)                   |
    84  |     | consensus_time_between_blocks        | Timing | Time between committing last block and (receiving proposal creating proposal) |
    85  | A   | consensus_rounds                     | Gauge  | Number of rounds                                                              |
    86  |     | consensus_prevotes                   | Gauge  |                                                                               |
    87  |     | consensus_precommits                 | Gauge  |                                                                               |
    88  |     | consensus_prevotes_total_power       | Gauge  |                                                                               |
    89  |     | consensus_precommits_total_power     | Gauge  |                                                                               |
    90  | A   | consensus_num_txs                    | Gauge  |                                                                               |
    91  | A   | mempool_size                         | Gauge  |                                                                               |
    92  | A   | consensus_total_txs                  | Gauge  |                                                                               |
    93  | A   | consensus_block_size                 | Gauge  | In bytes                                                                      |
    94  | A   | p2p_peers                            | Gauge  | Number of peers node's connected to                                           |
    95  
    96  `A` - will be implemented in the fist place.
    97  
    98  **Proposed solution**
    99  
   100  ## Status
   101  
   102  Proposed.
   103  
   104  ## Consequences
   105  
   106  ### Positive
   107  
   108  Better visibility, support of variety of monitoring backends
   109  
   110  ### Negative
   111  
   112  One more library to audit, messing metrics reporting code with business domain.
   113  
   114  ### Neutral
   115  
   116  -