github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/metrics.md (about) 1 --- 2 layout: post 3 title: METRICS 4 permalink: /docs/metrics 5 redirect_from: 6 - /metrics.md/ 7 - /docs/metrics.md/ 8 --- 9 10 ## Introduction 11 12 AIStore tracks, logs, and reports a large and growing number of counters, latencies and throughputs including (but not limited to) metrics that reflect cluster recovery and global rebalancing, all [extended long-running operations](/xact/README.md), and, of course, the basic read, write, list transactions, and more. 13 14 Viewership is equally supported via: 15 16 1. System logs 17 2. [CLI](/docs/cli.md) and, in particular, [`ais show performance`](/docs/cli/performance.md) command 18 3. [Prometheus](/docs/prometheus.md) 19 4. Any [StatsD](https://github.com/etsy/statsd) compliant [backend](https://github.com/statsd/statsd/blob/master/docs/backend.md#supported-backends) including Graphite/Grafana 20 21 > For general information on AIS metrics, see [Statistics, Collected Metrics, Visualization](/docs/metrics.md). 22 23 > AIStore includes `aisloader` - a powerful tool that we use to simulate a variety of AI workloads. For numerous command-line options and usage examples, please see [`aisloader`](/docs/aisloader.md) and [How To Benchmark AIStore](/docs/howto_benchmark.md). 24 25 > Or, just run `make aisloader; aisloader` and see its detailed online help. Note as well that `aisloader` is fully StatsD-enabled and supports detailed protocol-level tracing with runtime on and off switching. 26 27 ## Table of Contents 28 - [StatsD and Prometheus](#statsd-and-prometheus) 29 - [Conventions](#conventions) 30 - [Proxy metrics: IO counters](#proxy-metrics-io-counters) 31 - [Proxy metrics: error counters](#proxy-metrics-error-counters) 32 - [Proxy metrics: latencies](#proxy-metrics-latencies) 33 - [Target metrics](#target-metrics) 34 - [AIS loader metrics](#ais-loader-metrics) 35 - [Debug-Mode Observability](#debug-mode-observability) 36 37 ## StatsD and Prometheus 38 39 AIStore generates a growing number of detailed performance metrics. Other than AIS logs, the stats can be viewed via: 40 41 * StatsD/Grafana visualization 42 or 43 * Prometheus visualization 44 45 > [StatsD](https://github.com/etsy/statsd) publishes local statistics to a compliant backend service (e.g., [Graphite](https://graphite.readthedocs.io/en/latest/)) for easy and powerful stats aggregation and visualization. 46 47 > AIStore is a fully compliant [Prometheus exporter](https://prometheus.io/docs/instrumenting/writing_exporters/) that natively supports [Prometheus](https://prometheus.io/) stats collection. There's no special configuration - the only thing required to enable the corresponding integration is letting AIStore know whether to publish its stats via StatsD **or** Prometheus. 48 49 The StatsD/Grafana option imposes a certain easy-to-meet requirement on the AIStore deployment. Namely, it requires that StatsD daemon (aka service) is **deployed locally with each AIS target and with each AIS proxy**. 50 51 At startup AIStore daemons, both targets and gateways, try to UDP-ping their respective local [StatsD](https://github.com/etsy/statsd) daemons on the UDP port `8125` unless redefined via environment `AIS_STATSD_PORT`. You can disable StatsD reachability probing by setting another environment variable - `AIS_STATSD_PROBE` - to `false` or `no`. 52 53 If StatsD server is *not* listening on the local 8125, the local AIS target (or proxy) will then run without StatsD, and the corresponding stats won't be captured and won't be visualized. 54 55 > For details on all StatsD-supported backends, please refer to [this document](https://github.com/etsy/statsd/blob/master/docs/backend.md). 56 57 > For Prometheus integration, please refer to [this separate document](/docs/prometheus.md) 58 59 ## Conventions 60 61 All AIS metric names (or simply, metrics) are logged and reported to the StatsD/Grafana using the following naming pattern: 62 63 `prefix.bucket.metric_name.metric_value|metric_type`, where `prefix` is one of: 64 65 * `aisproxy.<daemon_id>` 66 * `aistarget.<daemon_id>` 67 or 68 * `aisloader.<hostname>-<id>` 69 70 and `metric_type` is `ms` for time duration, `c` for a counter, and `g` for a gauge. 71 72 More precisely, AIS metrics are named and grouped as follows: 73 74 ### Proxy metrics: IO counters 75 76 All collected/tracked *counters* are 64-bit cumulative integers that continuously increment with each event that they (respectively) track. 77 78 | Name | Comment | 79 | --- | --- | 80 | `aisproxy.<daemon_id>.get` | number of GET-object requests | 81 | `aisproxy.<daemon_id>.put` | number of PUT-object requests | 82 | `aisproxy.<daemon_id>.del` | number of DELETE-object requests | 83 | `aisproxy.<daemon_id>.lst` | number of LIST-objects requests | 84 | `aisproxy.<daemon_id>.ren` | ... RENAME ... | 85 | `aisproxy.<daemon_id>.pst` | ... POST ... | 86 87 ### Proxy metrics: error counters 88 89 | Name | Comment | 90 | --- | --- | 91 | `aisproxy.<daemon_id>.err` | Total number of errors | 92 | `aisproxy.<daemon_id>.err.get` | Number of GET-object errors | 93 | `aisproxy.<daemon_id>.err.put` | Number of PUT-object errors | 94 | `aisproxy.<daemon_id>.err.head` | Number of HEAD-object errors | 95 | `aisproxy.<daemon_id>.err.delete` | Number of DELETE-object errors | 96 | `aisproxy.<daemon_id>.err.list` | Number of LIST-objects errors | 97 | `aisproxy.<daemon_id>.err.range` | ... RANGE ... | 98 | `aisproxy.<daemon_id>.err.post` | ... POST ... | 99 100 > For the most recently updated list of counters, please refer to [the source](/stats/common_stats.go) 101 102 ### Proxy metrics: latencies 103 104 All request latencies are reported to **StatsD/Grafana in milliseconds**. 105 106 > **Note**: each `aisnode` (proxy and target) periodically logs the same latencies in microseconds with a (configurable) logging interval (default = 10s). 107 108 > Generally, AIStore logs can be considered a redundant source of information on system performance - the information that can be used either in addition to Graphite/Grafana or when the latter is not deployed or configured. 109 110 | Name | Comment | 111 | --- | --- | 112 | `aisproxy.<daemon_id>.get` | GET-object latency | 113 | `aisproxy.<daemon_id>.lst` | LIST-objects latency | 114 | `aisproxy.<daemon_id>.kalive` | Keep-Alive (roundtrip) latency | 115 116 ### Target Metrics 117 118 AIS target metrics include **all** of the proxy metrics (see above), plus the following: 119 120 | Name | Comment | 121 | --- | --- | 122 | `aistarget.<daemon_id>.get.cold` | number of cold-GET object requests | 123 | `aistarget.<daemon_id>.get.cold.size` | cold GET cumulative size (in bytes) | 124 | `aistarget.<daemon_id>.lru.evict` | number of LRU-evicted objects | 125 | `aistarget.<daemon_id>.tx` | number of objects sent by the target | 126 | `aistarget.<daemon_id>.tx.size` | cumulative size (in bytes) of all transmitted objects | 127 | `aistarget.<daemon_id>.rx` | number of objects received by the target | 128 | `aistarget.<daemon_id>.rx.size` | cumulative size (in bytes) of all the received objects | 129 130 > For the most recently updated list of counters, please refer to [the source](/stats/target_stats.go) 131 132 ### AIS loader metrics 133 134 AIS loader generates metrics for 3 (three) types of requests: 135 136 * GET (object) - metric names are prefixed with `aisloader.<ip>.<loader_id>.get.` 137 * PUT (object) - metric names start with `aisloader.<ip>.<loader_id>.put.` 138 * Read cluster configuration - the prefix includes `aisloader.<ip>.<loader_id>.getconfig.` 139 140 All latency metrics are in milliseconds, all sizes are always in bytes. 141 142 #### GET object 143 144 > **Note**: in the tables below, traced intervals of time are denoted as **(from time, to time)**, respectively. 145 146 | Name | Comment | 147 | --- | --- | 148 | `aisloader.<hostname>-<id>.get.pending.<value>` | number of unfinished GET requests waiting in a queue (updated after every completed request) | 149 | `aisloader.<hostname>-<id>.get.count.1` | total number of requests | 150 | `aisloader.<hostname>-<id>.get.error.1` | total number of failed requests | 151 | `aisloader.<hostname>-<id>.get.throughput.<value>` | total size of received objects | 152 | `aisloader.<hostname>-<id>.get.latency.<value>` | request latency = (request initialized, data transfer successfully completes) | 153 | `aisloader.<hostname>-<id>.get.latency.proxyconn.<value>` | (request started, connected to a proxy) | 154 | `aisloader.<hostname>-<id>.get.latency.proxy.<value>` | (connected to proxy, proxy redirected) | 155 | `aisloader.<hostname>-<id>.get.latency.targetconn.<value>` | (proxy redirected, connected to target) | 156 | `aisloader.<hostname>-<id>.get.latency.target.<value>` | (connected to target, target responded) | 157 | `aisloader.<hostname>-<id>.get.latency.posthttp.<value>` | (target responded, data transfer completed) | 158 | `aisloader.<hostname>-<id>.get.latency.proxyheader.<value>` | (proxy makes a connection, proxy finishes writing headers to the connection) | 159 | `aisloader.<hostname>-<id>.get.latency.proxyrequest.<value>` | (proxy finishes writing headers, proxy completes writing request to the connection) | 160 | `aisloader.<hostname>-<id>.get.latency.proxyresponse.<value>` | (proxy finishes writing request to a connection, proxy gets the first bytes of the response) | 161 | `aisloader.<hostname>-<id>.get.latency.targetheader.<value>` | (target makes a connection, target finishes writing headers to the connection) | 162 | `aisloader.<hostname>-<id>.get.latency.targetrequest.<value>` | (target finishes writing headers, target completes writing request to the connection) | 163 | `aisloader.<hostname>-<id>.get.latency.targetresponse.<value>` | (target finishes writing request, proxy gets the first bytes of the response) | 164 165 #### PUT object 166 167 > **Note**: in the table, traced intervals of time are denoted as **(from time, to time)**: 168 169 | Name | Comment | 170 | --- | --- | 171 | `aisloader.<hostname>-<id>.put.pending.<value>` | number of unfinished PUT requests waiting in a queue (updated after every completed request) | 172 | `aisloader.<hostname>-<id>.put.count.1` | total number of requests | 173 | `aisloader.<hostname>-<id>.put.error.1` | total number of failed requests | 174 | `aisloader.<hostname>-<id>.put.throughput.<value>` | total size of objects PUT into a bucket | 175 | `aisloader.<hostname>-<id>.put.latency.<value>` | request latency = (request initialized, data transfer successfully complete) | 176 | `aisloader.<hostname>-<id>.put.latency.proxyconn.<value>` | (request started, connected to proxy) | 177 | `aisloader.<hostname>-<id>.put.latency.proxy.<value>` | (connected to proxy, proxy redirected) | 178 | `aisloader.<hostname>-<id>.put.latency.targetconn.<value>` | (proxy redirected, connected to target) | 179 | `aisloader.<hostname>-<id>.put.latency.target.<value>` | (connected to target, target responded) | 180 | `aisloader.<hostname>-<id>.put.latency.posthttp.<value>` | (target responded, data transfer completed) | 181 | `aisloader.<hostname>-<id>.put.latency.proxyheader.<value>` | (proxy makes a connection, proxy finishes writing headers) | 182 | `aisloader.<hostname>-<id>.put.latency.proxyrequest.<value>` | (proxy finishes writing headers, proxy completes writing request) | 183 | `aisloader.<hostname>-<id>.put.latency.proxyresponse.<value>` | (proxy finishes writing request, proxy gets the first bytes of the response) | 184 | `aisloader.<hostname>-<id>.put.latency.targetheader.<value>` | (target makes a connection, target finishes writing headers) | 185 | `aisloader.<hostname>-<id>.put.latency.targetrequest.<value>` | (target finishes writing headers, target completes writing request) | 186 | `aisloader.<hostname>-<id>.put.latency.targetresponse.<value>` | (target finishes writing request, proxy gets the first bytes of the response) | 187 188 #### Read cluster configuration 189 190 > **Note**: traced intervals of time are denoted as **(from time, to time)**: 191 192 | Name | Comment | 193 | --- | --- | 194 | `aisloader.<hostname>-<id>.getconfig.count.1` | total number of requests to read cluster settings | 195 | `aisloader.<hostname>-<id>.getconfig.latency.<value>` | request latency = (read configuration request started, configuration received) | 196 | `aisloader.<hostname>-<id>.getconfig.latency.proxyconn.<value>` | (read configuration request started, connection to a proxy is made) | 197 | `aisloader.<hostname>-<id>.getconfig.latency.proxy.<value>` | (connection to a proxy is made, proxy redirected the request) | 198 199 A somewhat outdated example of how these metrics show up in the Grafana dashboard follows: 200 201 ![AIS loader metrics](images/aisloader-statsd-grafana.png) 202 203 ## Debug-Mode Observability 204 205 For development and, more generally, for any non-production deployments AIS supports [building with debug](/Makefile), for instance: 206 207 ```sh 208 $ MODE=debug make deploy 209 ``` 210 211 As usual, debug builds incorporate more runtime checks and extra logging. But in addition AIS debug build provides a special **API endpoint** at `hostname:port/debug/vars` that can be accessed (via browser or Curl) at any time to display the current values of: 212 213 * all stats counters (including error counters) 214 * all latencies including keepalive 215 * mountpath capacities 216 * mountpath (disk) utilizations 217 * total number of goroutines 218 * memory stats 219 220 and more. 221 222 > Notation `hostname:port` stands for TCP endpoint of *any* deployed AIS node, gateway or storage target. 223 224 Example output: 225 226 ```console 227 $ curl hostname:port/debug/vars 228 { 229 "ais.ios": {"/ais/mp1:util%": 20, "/ais/mp2:util%": 23, "/ais/mp3:util%": 22, "/ais/mp4:util%": 25}, 230 "ais.stats": {"kalive.ns": 735065, "lst.n": 45, "lst.ns": 2892015, "num-goroutines": 27, "put.n": 1762, "put.ns": 1141380, "put.redir.ns": 16596465, "up.ns.time": 30012389406}, 231 "cmdline": ["/bin/aisnode","-config=.ais/ais.json","-local_config=.ais/ais_local.json","-role=target"], 232 "memstats": {"Alloc":43209256,"TotalAlloc":57770120,"Sys":75056128,"Lookups":0,"Mallocs":215893,"Frees":103090,"HeapAlloc":43209256, ...} 233 ... 234 } 235 ```