github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/metrics/prometheus/alerts.md

github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/metrics/prometheus/alerts.md (about)

     1  # How to configure Prometheus AlertManager
     2  
     3  Alerting with prometheus is two step process. First we setup alerts in Prometheus server and then we need to send alerts to the AlertManager.
     4  Prometheus AlertManager is the component that manages sending, inhibition and silencing of the alerts generated from Prometheus. The AlertManager can be configured to send alerts to variety of receivers. Refer [Prometheus AlertManager receivers](https://prometheus.io/docs/alerting/latest/configuration/#receiver) for more details.
     5  
     6  Follow below steps to enable and use AlertManager.
     7  
     8  ## Deploy and start AlertManager
     9  Install Prometheus AlertManager from https://prometheus.io/download/ and create configuration as below
    10  
    11  ```yaml
    12  route:
    13    group_by: ['alertname']
    14    group_wait: 30s
    15    group_interval: 5m
    16    repeat_interval: 1h
    17    receiver: 'web.hook'
    18  receivers:
    19    - name: 'web.hook'
    20      webhook_configs:
    21        - url: 'http://127.0.0.1:8010/webhook'
    22  inhibit_rules:
    23    - source_match:
    24        severity: 'critical'
    25      target_match:
    26        severity: 'warning'
    27      equal: ['alertname', 'dev', 'instance']
    28  ```
    29  
    30  This sample configuration uses a `webhook` at http://127.0.0.1:8010/webhook to post the alerts.
    31  Start the AlertManager and it listens on port `9093` by default. Make sure your webhook is up and listening for the alerts.
    32  
    33  ## Configure Prometheus to use AlertManager
    34  
    35  Add below section to your `prometheus.yml`
    36  ```yaml
    37  alerting:
    38    alertmanagers:
    39    - static_configs:
    40      - targets: ['localhost:9093']
    41  rule_files:
    42    - rules.yml
    43  ```
    44  Here `rules.yml` is the file which should contain the alerting rules defined.
    45  
    46  ## Add rules for your deployment
    47  Below is a sample alerting rules configuration for MinIO. Refer https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ for more instructions on writing alerting rules for Prometheus.
    48  
    49  ```yaml
    50  groups:
    51  - name: example
    52    rules:
    53    - alert: MinIOClusterTolerance
    54      expr: minio_cluster_health_erasure_set_status < 1
    55      for: 5m
    56      labels:
    57        severity: critical
    58      annotations:
    59        summary: "Instance {{ $labels.server }} has lost quorum on pool {{ $labels.pool }} on set {{ $labels.set }}"
    60        description: "MinIO instance {{ $labels.server }} of job {{ $labels.job }} has lost quorum on pool {{ $labels.pool }} on set {{ $labels.set }} for more than 5 minutes."
    61  ```
    62  
    63  ## Verify the configuration and alerts
    64  To verify the above sample alert follow below steps
    65  
    66  1. Start a distributed MinIO instance (4 nodes setup)
    67  2. Start Prometheus server and AlertManager
    68  3. Bring down couple of MinIO instances to bring down the Erasure Set tolerance to -1 and verify the same with `mc admin prometheus metrics ALIAS | grep minio_cluster_health_erasure_set_status`
    69  4. Wait for 5 mins (as alert is configured to be firing after 5 mins), and verify that you see an entry in webhook for the alert as well as in Prometheus console as shown below
    70  
    71  ```json
    72  {
    73    "receiver": "web\\.hook",
    74    "status": "firing",
    75    "alerts": [
    76      {
    77        "status": "firing",
    78        "labels": {
    79          "alertname": "MinIOClusterTolerance",
    80          "instance": "localhost:9000",
    81          "job": "minio-job-node",
    82          "pool": "0",
    83          "server": "127.0.0.1:9000",
    84          "set": "0",
    85          "severity": "critical"
    86        },
    87        "annotations": {
    88          "description": "MinIO instance 127.0.0.1:9000 of job minio-job has tolerance <=0 for more than 5 minutes.",
    89          "summary": "Instance 127.0.0.1:9000 unable to tolerate node failures"
    90        },
    91        "startsAt": "2023-11-18T06:20:09.456Z",
    92        "endsAt": "0001-01-01T00:00:00Z",
    93        "generatorURL": "http://fedora-minio:9090/graph?g0.expr=minio_cluster_health_erasure_set_tolerance+%3C%3D+0&g0.tab=1",
    94        "fingerprint": "2255608b0da28ca3"
    95      }
    96    ],
    97    "groupLabels": {
    98      "alertname": "MinIOClusterTolerance"
    99    },
   100    "commonLabels": {
   101      "alertname": "MinIOClusterTolerance",
   102      "instance": "localhost:9000",
   103      "job": "minio-job-node",
   104      "pool": "0",
   105      "server": "127.0.0.1:9000",
   106      "set": "0",
   107      "severity": "critical"
   108    },
   109    "commonAnnotations": {
   110      "description": "MinIO instance 127.0.0.1:9000 of job minio-job has lost quorum on pool 0 on set 0 for more than 5 minutes.",
   111      "summary": "Instance 127.0.0.1:9000 has lot quorum on pool 0 on set 0"
   112    },
   113    "externalURL": "http://fedora-minio:9093",
   114    "version": "4",
   115    "groupKey": "{}:{alertname=\"MinIOClusterTolerance\"}",
   116    "truncatedAlerts": 0
   117  }
   118  ```
   119  
   120  ![Prometheus](https://raw.githubusercontent.com/minio/minio/master/docs/metrics/prometheus/minio-es-tolerance-alert.png)