github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/metrics/prometheus/alerts.md (about) 1 # How to configure Prometheus AlertManager 2 3 Alerting with prometheus is two step process. First we setup alerts in Prometheus server and then we need to send alerts to the AlertManager. 4 Prometheus AlertManager is the component that manages sending, inhibition and silencing of the alerts generated from Prometheus. The AlertManager can be configured to send alerts to variety of receivers. Refer [Prometheus AlertManager receivers](https://prometheus.io/docs/alerting/latest/configuration/#receiver) for more details. 5 6 Follow below steps to enable and use AlertManager. 7 8 ## Deploy and start AlertManager 9 Install Prometheus AlertManager from https://prometheus.io/download/ and create configuration as below 10 11 ```yaml 12 route: 13 group_by: ['alertname'] 14 group_wait: 30s 15 group_interval: 5m 16 repeat_interval: 1h 17 receiver: 'web.hook' 18 receivers: 19 - name: 'web.hook' 20 webhook_configs: 21 - url: 'http://127.0.0.1:8010/webhook' 22 inhibit_rules: 23 - source_match: 24 severity: 'critical' 25 target_match: 26 severity: 'warning' 27 equal: ['alertname', 'dev', 'instance'] 28 ``` 29 30 This sample configuration uses a `webhook` at http://127.0.0.1:8010/webhook to post the alerts. 31 Start the AlertManager and it listens on port `9093` by default. Make sure your webhook is up and listening for the alerts. 32 33 ## Configure Prometheus to use AlertManager 34 35 Add below section to your `prometheus.yml` 36 ```yaml 37 alerting: 38 alertmanagers: 39 - static_configs: 40 - targets: ['localhost:9093'] 41 rule_files: 42 - rules.yml 43 ``` 44 Here `rules.yml` is the file which should contain the alerting rules defined. 45 46 ## Add rules for your deployment 47 Below is a sample alerting rules configuration for MinIO. Refer https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ for more instructions on writing alerting rules for Prometheus. 48 49 ```yaml 50 groups: 51 - name: example 52 rules: 53 - alert: MinIOClusterTolerance 54 expr: minio_cluster_health_erasure_set_status < 1 55 for: 5m 56 labels: 57 severity: critical 58 annotations: 59 summary: "Instance {{ $labels.server }} has lost quorum on pool {{ $labels.pool }} on set {{ $labels.set }}" 60 description: "MinIO instance {{ $labels.server }} of job {{ $labels.job }} has lost quorum on pool {{ $labels.pool }} on set {{ $labels.set }} for more than 5 minutes." 61 ``` 62 63 ## Verify the configuration and alerts 64 To verify the above sample alert follow below steps 65 66 1. Start a distributed MinIO instance (4 nodes setup) 67 2. Start Prometheus server and AlertManager 68 3. Bring down couple of MinIO instances to bring down the Erasure Set tolerance to -1 and verify the same with `mc admin prometheus metrics ALIAS | grep minio_cluster_health_erasure_set_status` 69 4. Wait for 5 mins (as alert is configured to be firing after 5 mins), and verify that you see an entry in webhook for the alert as well as in Prometheus console as shown below 70 71 ```json 72 { 73 "receiver": "web\\.hook", 74 "status": "firing", 75 "alerts": [ 76 { 77 "status": "firing", 78 "labels": { 79 "alertname": "MinIOClusterTolerance", 80 "instance": "localhost:9000", 81 "job": "minio-job-node", 82 "pool": "0", 83 "server": "127.0.0.1:9000", 84 "set": "0", 85 "severity": "critical" 86 }, 87 "annotations": { 88 "description": "MinIO instance 127.0.0.1:9000 of job minio-job has tolerance <=0 for more than 5 minutes.", 89 "summary": "Instance 127.0.0.1:9000 unable to tolerate node failures" 90 }, 91 "startsAt": "2023-11-18T06:20:09.456Z", 92 "endsAt": "0001-01-01T00:00:00Z", 93 "generatorURL": "http://fedora-minio:9090/graph?g0.expr=minio_cluster_health_erasure_set_tolerance+%3C%3D+0&g0.tab=1", 94 "fingerprint": "2255608b0da28ca3" 95 } 96 ], 97 "groupLabels": { 98 "alertname": "MinIOClusterTolerance" 99 }, 100 "commonLabels": { 101 "alertname": "MinIOClusterTolerance", 102 "instance": "localhost:9000", 103 "job": "minio-job-node", 104 "pool": "0", 105 "server": "127.0.0.1:9000", 106 "set": "0", 107 "severity": "critical" 108 }, 109 "commonAnnotations": { 110 "description": "MinIO instance 127.0.0.1:9000 of job minio-job has lost quorum on pool 0 on set 0 for more than 5 minutes.", 111 "summary": "Instance 127.0.0.1:9000 has lot quorum on pool 0 on set 0" 112 }, 113 "externalURL": "http://fedora-minio:9093", 114 "version": "4", 115 "groupKey": "{}:{alertname=\"MinIOClusterTolerance\"}", 116 "truncatedAlerts": 0 117 } 118 ``` 119 120 ![Prometheus](https://raw.githubusercontent.com/minio/minio/master/docs/metrics/prometheus/minio-es-tolerance-alert.png)