github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-013-metrics.md (about) 1 # ADR 013: Observability 2 3 ## Changelog 4 5 * 20-01-2020: Initial Draft 6 7 ## Status 8 9 Proposed 10 11 ## Context 12 13 Telemetry is paramount into debugging and understanding what the application is doing and how it is 14 performing. We aim to expose metrics from modules and other core parts of the Cosmos SDK. 15 16 In addition, we should aim to support multiple configurable sinks that an operator may choose from. 17 By default, when telemetry is enabled, the application should track and expose metrics that are 18 stored in-memory. The operator may choose to enable additional sinks, where we support only 19 [Prometheus](https://prometheus.io/) for now, as it's battle-tested, simple to setup, open source, 20 and is rich with ecosystem tooling. 21 22 We must also aim to integrate metrics into the Cosmos SDK in the most seamless way possible such that 23 metrics may be added or removed at will and without much friction. To do this, we will use the 24 [go-metrics](https://github.com/hashicorp/go-metrics) library. 25 26 Finally, operators may enable telemetry along with specific configuration options. If enabled, metrics 27 will be exposed via `/metrics?format={text|prometheus}` via the API server. 28 29 ## Decision 30 31 We will add an additional configuration block to `app.toml` that defines telemetry settings: 32 33 ```toml 34 ############################################################################### 35 ### Telemetry Configuration ### 36 ############################################################################### 37 38 [telemetry] 39 40 # Prefixed with keys to separate services 41 service-name = {{ .Telemetry.ServiceName }} 42 43 # Enabled enables the application telemetry functionality. When enabled, 44 # an in-memory sink is also enabled by default. Operators may also enabled 45 # other sinks such as Prometheus. 46 enabled = {{ .Telemetry.Enabled }} 47 48 # Enable prefixing gauge values with hostname 49 enable-hostname = {{ .Telemetry.EnableHostname }} 50 51 # Enable adding hostname to labels 52 enable-hostname-label = {{ .Telemetry.EnableHostnameLabel }} 53 54 # Enable adding service to labels 55 enable-service-label = {{ .Telemetry.EnableServiceLabel }} 56 57 # PrometheusRetentionTime, when positive, enables a Prometheus metrics sink. 58 prometheus-retention-time = {{ .Telemetry.PrometheusRetentionTime }} 59 ``` 60 61 The given configuration allows for two sinks -- in-memory and Prometheus. We create a `Metrics` 62 type that performs all the bootstrapping for the operator, so capturing metrics becomes seamless. 63 64 ```go 65 // Metrics defines a wrapper around application telemetry functionality. It allows 66 // metrics to be gathered at any point in time. When creating a Metrics object, 67 // internally, a global metrics is registered with a set of sinks as configured 68 // by the operator. In addition to the sinks, when a process gets a SIGUSR1, a 69 // dump of formatted recent metrics will be sent to STDERR. 70 type Metrics struct { 71 memSink *metrics.InmemSink 72 prometheusEnabled bool 73 } 74 75 // Gather collects all registered metrics and returns a GatherResponse where the 76 // metrics are encoded depending on the type. Metrics are either encoded via 77 // Prometheus or JSON if in-memory. 78 func (m *Metrics) Gather(format string) (GatherResponse, error) { 79 switch format { 80 case FormatPrometheus: 81 return m.gatherPrometheus() 82 83 case FormatText: 84 return m.gatherGeneric() 85 86 case FormatDefault: 87 return m.gatherGeneric() 88 89 default: 90 return GatherResponse{}, fmt.Errorf("unsupported metrics format: %s", format) 91 } 92 } 93 ``` 94 95 In addition, `Metrics` allows us to gather the current set of metrics at any given point in time. An 96 operator may also choose to send a signal, SIGUSR1, to dump and print formatted metrics to STDERR. 97 98 During an application's bootstrapping and construction phase, if `Telemetry.Enabled` is `true`, the 99 API server will create an instance of a reference to `Metrics` object and will register a metrics 100 handler accordingly. 101 102 ```go 103 func (s *Server) Start(cfg config.Config) error { 104 // ... 105 106 if cfg.Telemetry.Enabled { 107 m, err := telemetry.New(cfg.Telemetry) 108 if err != nil { 109 return err 110 } 111 112 s.metrics = m 113 s.registerMetrics() 114 } 115 116 // ... 117 } 118 119 func (s *Server) registerMetrics() { 120 metricsHandler := func(w http.ResponseWriter, r *http.Request) { 121 format := strings.TrimSpace(r.FormValue("format")) 122 123 gr, err := s.metrics.Gather(format) 124 if err != nil { 125 rest.WriteErrorResponse(w, http.StatusBadRequest, fmt.Sprintf("failed to gather metrics: %s", err)) 126 return 127 } 128 129 w.Header().Set("Content-Type", gr.ContentType) 130 _, _ = w.Write(gr.Metrics) 131 } 132 133 s.Router.HandleFunc("/metrics", metricsHandler).Methods("GET") 134 } 135 ``` 136 137 Application developers may track counters, gauges, summaries, and key/value metrics. There is no 138 additional lifting required by modules to leverage profiling metrics. To do so, it's as simple as: 139 140 ```go 141 func (k BaseKeeper) MintCoins(ctx sdk.Context, moduleName string, amt sdk.Coins) error { 142 defer metrics.MeasureSince(time.Now(), "MintCoins") 143 // ... 144 } 145 ``` 146 147 ## Consequences 148 149 ### Positive 150 151 * Exposure into the performance and behavior of an application 152 153 ### Negative 154 155 ### Neutral 156 157 ## References