github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/metrics.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/metrics.md (about)

     1  ---
     2  layout: post
     3  title: METRICS
     4  permalink: /docs/metrics
     5  redirect_from:
     6   - /metrics.md/
     7   - /docs/metrics.md/
     8  ---
     9  
    10  ## Introduction
    11  
    12  AIStore tracks, logs, and reports a large and growing number of counters, latencies and throughputs including (but not limited to) metrics that reflect cluster recovery and global rebalancing, all [extended long-running operations](/xact/README.md), and, of course, the basic read, write, list transactions, and more.
    13  
    14  Viewership is equally supported via:
    15  
    16  1. System logs
    17  2. [CLI](/docs/cli.md) and, in particular, [`ais show performance`](/docs/cli/performance.md) command
    18  3. [Prometheus](/docs/prometheus.md)
    19  4. Any [StatsD](https://github.com/etsy/statsd) compliant [backend](https://github.com/statsd/statsd/blob/master/docs/backend.md#supported-backends) including Graphite/Grafana
    20  
    21  > For general information on AIS metrics, see [Statistics, Collected Metrics, Visualization](/docs/metrics.md).
    22  
    23  > AIStore includes `aisloader` - a powerful tool that we use to simulate a variety of AI workloads. For numerous command-line options and usage examples, please see [`aisloader`](/docs/aisloader.md) and [How To Benchmark AIStore](/docs/howto_benchmark.md).
    24  
    25  > Or, just run `make aisloader; aisloader` and see its detailed online help. Note as well that `aisloader` is fully StatsD-enabled and supports detailed protocol-level tracing with runtime on and off switching.
    26  
    27  ## Table of Contents
    28  - [StatsD and Prometheus](#statsd-and-prometheus)
    29  - [Conventions](#conventions)
    30    - [Proxy metrics: IO counters](#proxy-metrics-io-counters)
    31    - [Proxy metrics: error counters](#proxy-metrics-error-counters)
    32    - [Proxy metrics: latencies](#proxy-metrics-latencies)
    33    - [Target metrics](#target-metrics)
    34    - [AIS loader metrics](#ais-loader-metrics)
    35  - [Debug-Mode Observability](#debug-mode-observability)
    36  
    37  ## StatsD and Prometheus
    38  
    39  AIStore generates a growing number of detailed performance metrics. Other than AIS logs, the stats can be viewed via:
    40  
    41  * StatsD/Grafana visualization
    42  or
    43  * Prometheus visualization
    44  
    45  > [StatsD](https://github.com/etsy/statsd) publishes local statistics to a compliant backend service (e.g., [Graphite](https://graphite.readthedocs.io/en/latest/)) for easy and powerful stats aggregation and visualization.
    46  
    47  > AIStore is a fully compliant [Prometheus exporter](https://prometheus.io/docs/instrumenting/writing_exporters/) that natively supports [Prometheus](https://prometheus.io/) stats collection. There's no special configuration - the only thing required to enable the corresponding integration is letting AIStore know whether to publish its stats via StatsD **or** Prometheus.
    48  
    49  The StatsD/Grafana option imposes a certain easy-to-meet requirement on the AIStore deployment. Namely, it requires that StatsD daemon (aka service) is **deployed locally with each AIS target and with each AIS proxy**.
    50  
    51  At startup AIStore daemons, both targets and gateways, try to UDP-ping their respective local [StatsD](https://github.com/etsy/statsd) daemons on the UDP port `8125` unless redefined via environment `AIS_STATSD_PORT`. You can disable StatsD reachability probing by setting another environment variable - `AIS_STATSD_PROBE` - to `false` or `no`.
    52  
    53  If StatsD server is *not* listening on the local 8125, the local AIS target (or proxy) will then run without StatsD, and the corresponding stats won't be captured and won't be visualized.
    54  
    55  > For details on all StatsD-supported backends, please refer to [this document](https://github.com/etsy/statsd/blob/master/docs/backend.md).
    56  
    57  > For Prometheus integration, please refer to [this separate document](/docs/prometheus.md)
    58  
    59  ## Conventions
    60  
    61  All AIS metric names (or simply, metrics) are logged and reported to the StatsD/Grafana using the following naming pattern:
    62  
    63  `prefix.bucket.metric_name.metric_value|metric_type`, where `prefix` is one of:
    64  
    65  * `aisproxy.<daemon_id>`
    66  * `aistarget.<daemon_id>`
    67  or
    68  * `aisloader.<hostname>-<id>`
    69  
    70  and `metric_type` is `ms` for time duration, `c` for a counter, and `g` for a gauge.
    71  
    72  More precisely, AIS metrics are named and grouped as follows:
    73  
    74  ### Proxy metrics: IO counters
    75  
    76  All collected/tracked *counters* are 64-bit cumulative integers that continuously increment with each event that they (respectively) track.
    77  
    78  | Name | Comment |
    79  | --- | --- |
    80  | `aisproxy.<daemon_id>.get` | number of GET-object requests |
    81  | `aisproxy.<daemon_id>.put` | number of PUT-object requests |
    82  | `aisproxy.<daemon_id>.del` | number of DELETE-object requests |
    83  | `aisproxy.<daemon_id>.lst` | number of LIST-objects requests |
    84  | `aisproxy.<daemon_id>.ren` | ... RENAME ... |
    85  | `aisproxy.<daemon_id>.pst` | ... POST ... |
    86  
    87  ### Proxy metrics: error counters
    88  
    89  | Name | Comment |
    90  | --- | --- |
    91  | `aisproxy.<daemon_id>.err` | Total number of errors |
    92  | `aisproxy.<daemon_id>.err.get` | Number of GET-object errors |
    93  | `aisproxy.<daemon_id>.err.put` | Number of PUT-object errors |
    94  | `aisproxy.<daemon_id>.err.head` | Number of HEAD-object errors |
    95  | `aisproxy.<daemon_id>.err.delete` | Number of DELETE-object errors |
    96  | `aisproxy.<daemon_id>.err.list` | Number of LIST-objects errors |
    97  | `aisproxy.<daemon_id>.err.range` | ... RANGE ... |
    98  | `aisproxy.<daemon_id>.err.post` | ... POST ... |
    99  
   100  > For the most recently updated list of counters, please refer to [the source](/stats/common_stats.go)
   101  
   102  ### Proxy metrics: latencies
   103  
   104  All request latencies are reported to **StatsD/Grafana in milliseconds**.
   105  
   106  > **Note**: each `aisnode` (proxy and target) periodically logs the same latencies in microseconds with a (configurable) logging interval (default = 10s).
   107  
   108  > Generally, AIStore logs can be considered a redundant source of information on system performance - the information that can be used either in addition to Graphite/Grafana or when the latter is not deployed or configured.
   109  
   110  | Name | Comment |
   111  | --- | --- |
   112  | `aisproxy.<daemon_id>.get` | GET-object latency |
   113  | `aisproxy.<daemon_id>.lst` | LIST-objects latency |
   114  | `aisproxy.<daemon_id>.kalive` | Keep-Alive (roundtrip) latency |
   115  
   116  ### Target Metrics
   117  
   118  AIS target metrics include **all** of the proxy metrics (see above), plus the following:
   119  
   120  | Name | Comment |
   121  | --- | --- |
   122  | `aistarget.<daemon_id>.get.cold` | number of cold-GET object requests |
   123  | `aistarget.<daemon_id>.get.cold.size` | cold GET cumulative size (in bytes) |
   124  | `aistarget.<daemon_id>.lru.evict` | number of LRU-evicted objects |
   125  | `aistarget.<daemon_id>.tx` | number of objects sent by the target |
   126  | `aistarget.<daemon_id>.tx.size` | cumulative size (in bytes) of all transmitted objects |
   127  | `aistarget.<daemon_id>.rx` |  number of objects received by the target |
   128  | `aistarget.<daemon_id>.rx.size` | cumulative size (in bytes) of all the received objects |
   129  
   130  > For the most recently updated list of counters, please refer to [the source](/stats/target_stats.go)
   131  
   132  ### AIS loader metrics
   133  
   134  AIS loader generates metrics for 3 (three) types of requests:
   135  
   136  * GET (object) - metric names are prefixed with `aisloader.<ip>.<loader_id>.get.`
   137  * PUT (object) - metric names start with `aisloader.<ip>.<loader_id>.put.`
   138  * Read cluster configuration - the prefix includes `aisloader.<ip>.<loader_id>.getconfig.`
   139  
   140  All latency metrics are in milliseconds, all sizes are always in bytes.
   141  
   142  #### GET object
   143  
   144  > **Note**: in the tables below, traced intervals of time are denoted as **(from time, to time)**, respectively.
   145  
   146  | Name | Comment |
   147  | --- | --- |
   148  | `aisloader.<hostname>-<id>.get.pending.<value>` | number of unfinished GET requests waiting in a queue (updated after every completed request) |
   149  | `aisloader.<hostname>-<id>.get.count.1` | total number of requests |
   150  | `aisloader.<hostname>-<id>.get.error.1` | total number of failed requests |
   151  | `aisloader.<hostname>-<id>.get.throughput.<value>` | total size of received objects |
   152  | `aisloader.<hostname>-<id>.get.latency.<value>` | request latency = (request initialized, data transfer successfully completes) |
   153  | `aisloader.<hostname>-<id>.get.latency.proxyconn.<value>` | (request started, connected to a proxy) |
   154  | `aisloader.<hostname>-<id>.get.latency.proxy.<value>` | (connected to proxy, proxy redirected) |
   155  | `aisloader.<hostname>-<id>.get.latency.targetconn.<value>` | (proxy redirected, connected to target) |
   156  | `aisloader.<hostname>-<id>.get.latency.target.<value>` | (connected to target, target responded) |
   157  | `aisloader.<hostname>-<id>.get.latency.posthttp.<value>` | (target responded, data transfer completed) |
   158  | `aisloader.<hostname>-<id>.get.latency.proxyheader.<value>` | (proxy makes a connection, proxy finishes writing headers to the connection) |
   159  | `aisloader.<hostname>-<id>.get.latency.proxyrequest.<value>` | (proxy finishes writing headers, proxy completes writing request to the connection) |
   160  | `aisloader.<hostname>-<id>.get.latency.proxyresponse.<value>` | (proxy finishes writing request to a connection, proxy gets the first bytes of the response) |
   161  | `aisloader.<hostname>-<id>.get.latency.targetheader.<value>` | (target makes a connection, target finishes writing headers to the connection) |
   162  | `aisloader.<hostname>-<id>.get.latency.targetrequest.<value>` | (target finishes writing headers, target completes writing request to the connection) |
   163  | `aisloader.<hostname>-<id>.get.latency.targetresponse.<value>` | (target finishes writing request, proxy gets the first bytes of the response) |
   164  
   165  #### PUT object
   166  
   167  > **Note**: in the table, traced intervals of time are denoted as **(from time, to time)**:
   168  
   169  | Name | Comment |
   170  | --- | --- |
   171  | `aisloader.<hostname>-<id>.put.pending.<value>` | number of unfinished PUT requests waiting in a queue (updated after every completed request) |
   172  | `aisloader.<hostname>-<id>.put.count.1` | total number of requests |
   173  | `aisloader.<hostname>-<id>.put.error.1` | total number of failed requests |
   174  | `aisloader.<hostname>-<id>.put.throughput.<value>` | total size of objects PUT into a bucket |
   175  | `aisloader.<hostname>-<id>.put.latency.<value>` | request latency = (request initialized, data transfer successfully complete) |
   176  | `aisloader.<hostname>-<id>.put.latency.proxyconn.<value>` | (request started, connected to proxy) |
   177  | `aisloader.<hostname>-<id>.put.latency.proxy.<value>` | (connected to proxy, proxy redirected) |
   178  | `aisloader.<hostname>-<id>.put.latency.targetconn.<value>` | (proxy redirected, connected to target) |
   179  | `aisloader.<hostname>-<id>.put.latency.target.<value>` | (connected to target, target responded) |
   180  | `aisloader.<hostname>-<id>.put.latency.posthttp.<value>` | (target responded, data transfer completed) |
   181  | `aisloader.<hostname>-<id>.put.latency.proxyheader.<value>` | (proxy makes a connection, proxy finishes writing headers) |
   182  | `aisloader.<hostname>-<id>.put.latency.proxyrequest.<value>` | (proxy finishes writing headers, proxy completes writing request) |
   183  | `aisloader.<hostname>-<id>.put.latency.proxyresponse.<value>` | (proxy finishes writing request, proxy gets the first bytes of the response) |
   184  | `aisloader.<hostname>-<id>.put.latency.targetheader.<value>` | (target makes a connection, target finishes writing headers) |
   185  | `aisloader.<hostname>-<id>.put.latency.targetrequest.<value>` | (target finishes writing headers, target completes writing request) |
   186  | `aisloader.<hostname>-<id>.put.latency.targetresponse.<value>` | (target finishes writing request, proxy gets the first bytes of the response) |
   187  
   188  #### Read cluster configuration
   189  
   190  > **Note**: traced intervals of time are denoted as **(from time, to time)**:
   191  
   192  | Name | Comment |
   193  | --- | --- |
   194  | `aisloader.<hostname>-<id>.getconfig.count.1` | total number of requests to read cluster settings |
   195  | `aisloader.<hostname>-<id>.getconfig.latency.<value>` | request latency = (read configuration request started, configuration received) |
   196  | `aisloader.<hostname>-<id>.getconfig.latency.proxyconn.<value>` | (read configuration request started, connection to a proxy is made) |
   197  | `aisloader.<hostname>-<id>.getconfig.latency.proxy.<value>` | (connection to a proxy is made, proxy redirected the request) |
   198  
   199  A somewhat outdated example of how these metrics show up in the Grafana dashboard follows:
   200  
   201  ![AIS loader metrics](images/aisloader-statsd-grafana.png)
   202  
   203  ## Debug-Mode Observability
   204  
   205  For development and, more generally, for any non-production deployments AIS supports [building with debug](/Makefile), for instance:
   206  
   207  ```sh
   208  $ MODE=debug make deploy
   209  ```
   210  
   211  As usual, debug builds incorporate more runtime checks and extra logging. But in addition AIS debug build provides a special **API endpoint** at `hostname:port/debug/vars` that can be accessed (via browser or Curl) at any time to display the current values of:
   212  
   213  * all stats counters (including error counters)
   214  * all latencies including keepalive
   215  * mountpath capacities
   216  * mountpath (disk) utilizations
   217  * total number of goroutines
   218  * memory stats
   219  
   220  and more.
   221  
   222  > Notation `hostname:port` stands for TCP endpoint of *any* deployed AIS node, gateway or storage target.
   223  
   224  Example output:
   225  
   226  ```console
   227  $ curl hostname:port/debug/vars
   228  {
   229  "ais.ios": {"/ais/mp1:util%": 20, "/ais/mp2:util%": 23, "/ais/mp3:util%": 22, "/ais/mp4:util%": 25},
   230  "ais.stats": {"kalive.ns": 735065, "lst.n": 45, "lst.ns": 2892015, "num-goroutines": 27, "put.n": 1762, "put.ns": 1141380, "put.redir.ns": 16596465, "up.ns.time": 30012389406},
   231  "cmdline": ["/bin/aisnode","-config=.ais/ais.json","-local_config=.ais/ais_local.json","-role=target"],
   232  "memstats": {"Alloc":43209256,"TotalAlloc":57770120,"Sys":75056128,"Lookups":0,"Mallocs":215893,"Frees":103090,"HeapAlloc":43209256, ...}
   233  ...
   234  }
   235  ```