github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/monitor.md (about)

     1  ---
     2  title: Monitoring using Prometheus
     3  description: A guide to monitoring your lakeFS Installation with Prometheus.
     4  parent: Reference
     5  redirect_from: /deploying-aws/monitor.md
     6  ---
     7  
     8  # Monitoring using Prometheus
     9  
    10  {: .pb-3 }
    11  
    12  {% include toc.html %}
    13  
    14  ## Example prometheus.yml
    15  
    16  lakeFS exposes metrics through the same port used by the lakeFS service, using the standard `/metrics` path.
    17  An example `prometheus.yml` could look like this:
    18  
    19  ```yaml
    20  scrape_configs:
    21  - job_name: lakeFS
    22    scrape_interval: 10s
    23    metrics_path: /metrics
    24    static_configs:
    25    - targets:
    26      - lakefs.example.com:8000
    27  ```
    28  
    29  ## Metrics exposed by lakeFS
    30  
    31  By default, Prometheus exports metrics with OS process information like memory and CPU.
    32  It also includes Go-specific metrics such as details about GC and a number of goroutines.
    33  You can learn about these default metrics in this [post](https://povilasv.me/prometheus-go-metrics/){: target="_blank" }.
    34  
    35  In addition, lakeFS exposes the following metrics to help monitor your deployment: 
    36  
    37  | Name in Prometheus               | Description                                                 | Labels
    38  | api_requests_total               | [lakeFS API](api.html) requests (counter)                     | **code**: http status<br/>**method**: http method
    39  | api_request_duration_seconds     | Durations of lakeFS API requests (histogram)                | <br/>**operation**: name of API operation<br/>**code**: http status
    40  | gateway_request_duration_seconds | lakeFS [S3-compatible endpoint](s3.md) request (histogram)  | <br/>**operation**: name of gateway operation<br/>**code**: http status
    41  | s3_operation_duration_seconds    | Outgoing S3 operations (histogram)                          | <br/>**operation**: operation name<br/>**error**: "true" if error, "false" otherwise
    42  | gs_operation_duration_seconds    | Outgoing Google Storage operations (histogram)              | <br/>**operation**: operation name<br/>**error**: "true" if error, "false" otherwise
    43  | azure_operation_duration_seconds | Outgoing Azure storage operations (histogram)               | <br/>**operation**: operation name<br/>**error**: "true" if error, "false" otherwise
    44  | kv_request_duration_seconds      | Durations of KV requests(histogram)                         | <br/>**operation**: name of KV operation<br/>**type**: KV type(dynamodb, postgres, etc)
    45  | dynamo_request_duration_seconds  | Time spent doing DynamoDB requests                          | **operation**: DynamoDB operation name
    46  | dynamo_consumed_capacity_total   | The capacity units consumed by operation                    | **operation**: DynamoDB operation name
    47  | dynamo_failures_total            | The total number of errors while working for kv store       | **operation**: DynamoDB operation name
    48  | pgxpool_acquire_count            | PostgreSQL cumulative count of successful acquires from the pool | **db_name** default to the kv table name (kv)
    49  | pgxpool_acquire_duration_ns      | PostgreSQL total duration of all successful acquires from the pool in nanoseconds | **db_name** default to the kv table name (kv)
    50  | pgxpool_acquired_conns           | PostgreSQL number of currently acquired connections in the pool | **db_name** default to the kv table name (kv)
    51  | pgxpool_canceled_acquire_count   | PostgreSQL cumulative count of acquires from the pool that were canceled by a context | **db_name** default to the kv table name (kv)
    52  | pgxpool_constructing_conns       | PostgreSQL number of conns with construction in progress in the pool | **db_name** default to the kv table name (kv)
    53  | pgxpool_empty_acquire            | PostgreSQL cumulative count of successful acquires from the pool that waited for a resource to be released or constructed because the pool was empty | **db_name** default to the kv table name (kv)
    54  | pgxpool_idle_conns               | PostgreSQL number of currently idle conns in the pool       | **db_name** default to the kv table name (kv)
    55  | pgxpool_max_conns                | PostgreSQL maximum size of the pool                         | **db_name** default to the kv table name (kv)
    56  | pgxpool_total_conns              | PostgreSQL total number of resources currently in the pool  | **db_name** default to the kv table name (kv)
    57  
    58  
    59  ## Example queries
    60  
    61  **Note:** when using Prometheus functions like [rate](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate){: target="_blank"}
    62  or [increase](https://prometheus.io/docs/prometheus/latest/querying/functions/#increase){: target="_blank"}, results are extrapolated and may not be exact.
    63  {: .note}
    64  
    65  
    66  ### 99th percentile of API request latencies
    67  
    68  ```
    69  sum by (operation)(histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[1m])))
    70  ```
    71  
    72  ### 50th percentile of S3-compatible API latencies
    73  
    74  ```
    75  sum by (operation)(histogram_quantile(0.5, rate(gateway_request_duration_seconds_bucket[1m])))
    76  ```
    77  
    78  ### Number of errors in outgoing S3 requests
    79  
    80  ```
    81  sum by (operation) (increase(s3_operation_duration_seconds_count{error="true"}[1m]))
    82  ```
    83  
    84  ### Number of open connections to the database
    85  
    86  ```
    87  go_sql_stats_connections_open
    88  ```
    89  
    90  ### Example Grafana dashboard
    91  
    92  [![Grafana dashboard example]({{ site.baseurl }}/assets/img/grafana.png)]({{ site.baseurl }}/assets/img/grafana.png){: target="_blank" }