github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/monitor.md (about) 1 --- 2 title: Monitoring using Prometheus 3 description: A guide to monitoring your lakeFS Installation with Prometheus. 4 parent: Reference 5 redirect_from: /deploying-aws/monitor.md 6 --- 7 8 # Monitoring using Prometheus 9 10 {: .pb-3 } 11 12 {% include toc.html %} 13 14 ## Example prometheus.yml 15 16 lakeFS exposes metrics through the same port used by the lakeFS service, using the standard `/metrics` path. 17 An example `prometheus.yml` could look like this: 18 19 ```yaml 20 scrape_configs: 21 - job_name: lakeFS 22 scrape_interval: 10s 23 metrics_path: /metrics 24 static_configs: 25 - targets: 26 - lakefs.example.com:8000 27 ``` 28 29 ## Metrics exposed by lakeFS 30 31 By default, Prometheus exports metrics with OS process information like memory and CPU. 32 It also includes Go-specific metrics such as details about GC and a number of goroutines. 33 You can learn about these default metrics in this [post](https://povilasv.me/prometheus-go-metrics/){: target="_blank" }. 34 35 In addition, lakeFS exposes the following metrics to help monitor your deployment: 36 37 | Name in Prometheus | Description | Labels 38 | api_requests_total | [lakeFS API](api.html) requests (counter) | **code**: http status<br/>**method**: http method 39 | api_request_duration_seconds | Durations of lakeFS API requests (histogram) | <br/>**operation**: name of API operation<br/>**code**: http status 40 | gateway_request_duration_seconds | lakeFS [S3-compatible endpoint](s3.md) request (histogram) | <br/>**operation**: name of gateway operation<br/>**code**: http status 41 | s3_operation_duration_seconds | Outgoing S3 operations (histogram) | <br/>**operation**: operation name<br/>**error**: "true" if error, "false" otherwise 42 | gs_operation_duration_seconds | Outgoing Google Storage operations (histogram) | <br/>**operation**: operation name<br/>**error**: "true" if error, "false" otherwise 43 | azure_operation_duration_seconds | Outgoing Azure storage operations (histogram) | <br/>**operation**: operation name<br/>**error**: "true" if error, "false" otherwise 44 | kv_request_duration_seconds | Durations of KV requests(histogram) | <br/>**operation**: name of KV operation<br/>**type**: KV type(dynamodb, postgres, etc) 45 | dynamo_request_duration_seconds | Time spent doing DynamoDB requests | **operation**: DynamoDB operation name 46 | dynamo_consumed_capacity_total | The capacity units consumed by operation | **operation**: DynamoDB operation name 47 | dynamo_failures_total | The total number of errors while working for kv store | **operation**: DynamoDB operation name 48 | pgxpool_acquire_count | PostgreSQL cumulative count of successful acquires from the pool | **db_name** default to the kv table name (kv) 49 | pgxpool_acquire_duration_ns | PostgreSQL total duration of all successful acquires from the pool in nanoseconds | **db_name** default to the kv table name (kv) 50 | pgxpool_acquired_conns | PostgreSQL number of currently acquired connections in the pool | **db_name** default to the kv table name (kv) 51 | pgxpool_canceled_acquire_count | PostgreSQL cumulative count of acquires from the pool that were canceled by a context | **db_name** default to the kv table name (kv) 52 | pgxpool_constructing_conns | PostgreSQL number of conns with construction in progress in the pool | **db_name** default to the kv table name (kv) 53 | pgxpool_empty_acquire | PostgreSQL cumulative count of successful acquires from the pool that waited for a resource to be released or constructed because the pool was empty | **db_name** default to the kv table name (kv) 54 | pgxpool_idle_conns | PostgreSQL number of currently idle conns in the pool | **db_name** default to the kv table name (kv) 55 | pgxpool_max_conns | PostgreSQL maximum size of the pool | **db_name** default to the kv table name (kv) 56 | pgxpool_total_conns | PostgreSQL total number of resources currently in the pool | **db_name** default to the kv table name (kv) 57 58 59 ## Example queries 60 61 **Note:** when using Prometheus functions like [rate](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate){: target="_blank"} 62 or [increase](https://prometheus.io/docs/prometheus/latest/querying/functions/#increase){: target="_blank"}, results are extrapolated and may not be exact. 63 {: .note} 64 65 66 ### 99th percentile of API request latencies 67 68 ``` 69 sum by (operation)(histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[1m]))) 70 ``` 71 72 ### 50th percentile of S3-compatible API latencies 73 74 ``` 75 sum by (operation)(histogram_quantile(0.5, rate(gateway_request_duration_seconds_bucket[1m]))) 76 ``` 77 78 ### Number of errors in outgoing S3 requests 79 80 ``` 81 sum by (operation) (increase(s3_operation_duration_seconds_count{error="true"}[1m])) 82 ``` 83 84 ### Number of open connections to the database 85 86 ``` 87 go_sql_stats_connections_open 88 ``` 89 90 ### Example Grafana dashboard 91 92 []({{ site.baseurl }}/assets/img/grafana.png){: target="_blank" }