github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-013-metrics.md (about)

     1  # ADR 013: Observability
     2  
     3  ## Changelog
     4  
     5  * 20-01-2020: Initial Draft
     6  
     7  ## Status
     8  
     9  Proposed
    10  
    11  ## Context
    12  
    13  Telemetry is paramount into debugging and understanding what the application is doing and how it is
    14  performing. We aim to expose metrics from modules and other core parts of the Cosmos SDK.
    15  
    16  In addition, we should aim to support multiple configurable sinks that an operator may choose from.
    17  By default, when telemetry is enabled, the application should track and expose metrics that are
    18  stored in-memory. The operator may choose to enable additional sinks, where we support only
    19  [Prometheus](https://prometheus.io/) for now, as it's battle-tested, simple to setup, open source,
    20  and is rich with ecosystem tooling.
    21  
    22  We must also aim to integrate metrics into the Cosmos SDK in the most seamless way possible such that
    23  metrics may be added or removed at will and without much friction. To do this, we will use the
    24  [go-metrics](https://github.com/hashicorp/go-metrics) library.
    25  
    26  Finally, operators may enable telemetry along with specific configuration options. If enabled, metrics
    27  will be exposed via `/metrics?format={text|prometheus}` via the API server.
    28  
    29  ## Decision
    30  
    31  We will add an additional configuration block to `app.toml` that defines telemetry settings:
    32  
    33  ```toml
    34  ###############################################################################
    35  ###                         Telemetry Configuration                         ###
    36  ###############################################################################
    37  
    38  [telemetry]
    39  
    40  # Prefixed with keys to separate services
    41  service-name = {{ .Telemetry.ServiceName }}
    42  
    43  # Enabled enables the application telemetry functionality. When enabled,
    44  # an in-memory sink is also enabled by default. Operators may also enabled
    45  # other sinks such as Prometheus.
    46  enabled = {{ .Telemetry.Enabled }}
    47  
    48  # Enable prefixing gauge values with hostname
    49  enable-hostname = {{ .Telemetry.EnableHostname }}
    50  
    51  # Enable adding hostname to labels
    52  enable-hostname-label = {{ .Telemetry.EnableHostnameLabel }}
    53  
    54  # Enable adding service to labels
    55  enable-service-label = {{ .Telemetry.EnableServiceLabel }}
    56  
    57  # PrometheusRetentionTime, when positive, enables a Prometheus metrics sink.
    58  prometheus-retention-time = {{ .Telemetry.PrometheusRetentionTime }}
    59  ```
    60  
    61  The given configuration allows for two sinks -- in-memory and Prometheus. We create a `Metrics`
    62  type that performs all the bootstrapping for the operator, so capturing metrics becomes seamless.
    63  
    64  ```go
    65  // Metrics defines a wrapper around application telemetry functionality. It allows
    66  // metrics to be gathered at any point in time. When creating a Metrics object,
    67  // internally, a global metrics is registered with a set of sinks as configured
    68  // by the operator. In addition to the sinks, when a process gets a SIGUSR1, a
    69  // dump of formatted recent metrics will be sent to STDERR.
    70  type Metrics struct {
    71    memSink           *metrics.InmemSink
    72    prometheusEnabled bool
    73  }
    74  
    75  // Gather collects all registered metrics and returns a GatherResponse where the
    76  // metrics are encoded depending on the type. Metrics are either encoded via
    77  // Prometheus or JSON if in-memory.
    78  func (m *Metrics) Gather(format string) (GatherResponse, error) {
    79    switch format {
    80    case FormatPrometheus:
    81      return m.gatherPrometheus()
    82  
    83    case FormatText:
    84      return m.gatherGeneric()
    85  
    86    case FormatDefault:
    87      return m.gatherGeneric()
    88  
    89    default:
    90      return GatherResponse{}, fmt.Errorf("unsupported metrics format: %s", format)
    91    }
    92  }
    93  ```
    94  
    95  In addition, `Metrics` allows us to gather the current set of metrics at any given point in time. An
    96  operator may also choose to send a signal, SIGUSR1, to dump and print formatted metrics to STDERR.
    97  
    98  During an application's bootstrapping and construction phase, if `Telemetry.Enabled` is `true`, the
    99  API server will create an instance of a reference to `Metrics` object and will register a metrics
   100  handler accordingly.
   101  
   102  ```go
   103  func (s *Server) Start(cfg config.Config) error {
   104    // ...
   105  
   106    if cfg.Telemetry.Enabled {
   107      m, err := telemetry.New(cfg.Telemetry)
   108      if err != nil {
   109        return err
   110      }
   111  
   112      s.metrics = m
   113      s.registerMetrics()
   114    }
   115  
   116    // ...
   117  }
   118  
   119  func (s *Server) registerMetrics() {
   120    metricsHandler := func(w http.ResponseWriter, r *http.Request) {
   121      format := strings.TrimSpace(r.FormValue("format"))
   122  
   123      gr, err := s.metrics.Gather(format)
   124      if err != nil {
   125        rest.WriteErrorResponse(w, http.StatusBadRequest, fmt.Sprintf("failed to gather metrics: %s", err))
   126        return
   127      }
   128  
   129      w.Header().Set("Content-Type", gr.ContentType)
   130      _, _ = w.Write(gr.Metrics)
   131    }
   132  
   133    s.Router.HandleFunc("/metrics", metricsHandler).Methods("GET")
   134  }
   135  ```
   136  
   137  Application developers may track counters, gauges, summaries, and key/value metrics. There is no
   138  additional lifting required by modules to leverage profiling metrics. To do so, it's as simple as:
   139  
   140  ```go
   141  func (k BaseKeeper) MintCoins(ctx sdk.Context, moduleName string, amt sdk.Coins) error {
   142    defer metrics.MeasureSince(time.Now(), "MintCoins")
   143    // ...
   144  }
   145  ```
   146  
   147  ## Consequences
   148  
   149  ### Positive
   150  
   151  * Exposure into the performance and behavior of an application
   152  
   153  ### Negative
   154  
   155  ### Neutral
   156  
   157  ## References