github.com/thanos-io/thanos@v0.32.5/examples/alerts/alerts.md (about)

     1  # Alerts
     2  
     3  Here are some example alerts configured for Kubernetes environment.
     4  
     5  ## Compaction
     6  
     7  ```yaml mdox-exec="cat examples/tmp/thanos-compact.yaml"
     8  name: thanos-compact
     9  rules:
    10  - alert: ThanosCompactMultipleRunning
    11    annotations:
    12      description: No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.
    13      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactmultiplerunning
    14      summary: Thanos Compact has multiple instances running.
    15    expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1
    16    for: 5m
    17    labels:
    18      severity: warning
    19  - alert: ThanosCompactHalted
    20    annotations:
    21      description: Thanos Compact {{$labels.job}} has failed to run and now is halted.
    22      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthalted
    23      summary: Thanos Compact has failed to run and is now halted.
    24    expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1
    25    for: 5m
    26    labels:
    27      severity: warning
    28  - alert: ThanosCompactHighCompactionFailures
    29    annotations:
    30      description: Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.
    31      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthighcompactionfailures
    32      summary: Thanos Compact is failing to execute compactions.
    33    expr: |
    34      (
    35        sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m]))
    36      /
    37        sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m]))
    38      * 100 > 5
    39      )
    40    for: 15m
    41    labels:
    42      severity: warning
    43  - alert: ThanosCompactBucketHighOperationFailures
    44    annotations:
    45      description: Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.
    46      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactbuckethighoperationfailures
    47      summary: Thanos Compact Bucket is having a high number of operation failures.
    48    expr: |
    49      (
    50        sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m]))
    51      /
    52        sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m]))
    53      * 100 > 5
    54      )
    55    for: 15m
    56    labels:
    57      severity: warning
    58  - alert: ThanosCompactHasNotRun
    59    annotations:
    60      description: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.
    61      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthasnotrun
    62      summary: Thanos Compact has not uploaded anything for last 24 hours.
    63    expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24
    64    labels:
    65      severity: warning
    66  ```
    67  
    68  ## Ruler
    69  
    70  For Thanos Ruler we run some alerts in local Prometheus, to make sure that Thanos Ruler is working:
    71  
    72  ```yaml mdox-exec="cat examples/tmp/thanos-rule.yaml"
    73  name: thanos-rule
    74  rules:
    75  - alert: ThanosRuleQueueIsDroppingAlerts
    76    annotations:
    77      description: Thanos Rule {{$labels.instance}} is failing to queue alerts.
    78      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeueisdroppingalerts
    79      summary: Thanos Rule is failing to queue alerts.
    80    expr: |
    81      sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
    82    for: 5m
    83    labels:
    84      severity: critical
    85  - alert: ThanosRuleSenderIsFailingAlerts
    86    annotations:
    87      description: Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager.
    88      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulesenderisfailingalerts
    89      summary: Thanos Rule is failing to send alerts to alertmanager.
    90    expr: |
    91      sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
    92    for: 5m
    93    labels:
    94      severity: critical
    95  - alert: ThanosRuleHighRuleEvaluationFailures
    96    annotations:
    97      description: Thanos Rule {{$labels.instance}} is failing to evaluate rules.
    98      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationfailures
    99      summary: Thanos Rule is failing to evaluate rules.
   100    expr: |
   101      (
   102        sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m]))
   103      /
   104        sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m]))
   105      * 100 > 5
   106      )
   107    for: 5m
   108    labels:
   109      severity: critical
   110  - alert: ThanosRuleHighRuleEvaluationWarnings
   111    annotations:
   112      description: Thanos Rule {{$labels.instance}} has high number of evaluation warnings.
   113      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationwarnings
   114      summary: Thanos Rule has high number of evaluation warnings.
   115    expr: |
   116      sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0
   117    for: 15m
   118    labels:
   119      severity: info
   120  - alert: ThanosRuleRuleEvaluationLatencyHigh
   121    annotations:
   122      description: Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.
   123      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleruleevaluationlatencyhigh
   124      summary: Thanos Rule has high rule evaluation latency.
   125    expr: |
   126      (
   127        sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"})
   128      >
   129        sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
   130      )
   131    for: 5m
   132    labels:
   133      severity: warning
   134  - alert: ThanosRuleGrpcErrorRate
   135    annotations:
   136      description: Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.
   137      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulegrpcerrorrate
   138      summary: Thanos Rule is failing to handle grpc requests.
   139    expr: |
   140      (
   141        sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))
   142      /
   143        sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m]))
   144      * 100 > 5
   145      )
   146    for: 5m
   147    labels:
   148      severity: warning
   149  - alert: ThanosRuleConfigReloadFailure
   150    annotations:
   151      description: Thanos Rule {{$labels.job}} has not been able to reload its configuration.
   152      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleconfigreloadfailure
   153      summary: Thanos Rule has not been able to reload configuration.
   154    expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) != 1
   155    for: 5m
   156    labels:
   157      severity: info
   158  - alert: ThanosRuleQueryHighDNSFailures
   159    annotations:
   160      description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.
   161      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeryhighdnsfailures
   162      summary: Thanos Rule is having high number of DNS failures.
   163    expr: |
   164      (
   165        sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m]))
   166      /
   167        sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m]))
   168      * 100 > 1
   169      )
   170    for: 15m
   171    labels:
   172      severity: warning
   173  - alert: ThanosRuleAlertmanagerHighDNSFailures
   174    annotations:
   175      description: Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.
   176      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulealertmanagerhighdnsfailures
   177      summary: Thanos Rule is having high number of DNS failures.
   178    expr: |
   179      (
   180        sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m]))
   181      /
   182        sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m]))
   183      * 100 > 1
   184      )
   185    for: 15m
   186    labels:
   187      severity: warning
   188  - alert: ThanosRuleNoEvaluationFor10Intervals
   189    annotations:
   190      description: Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.
   191      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals
   192      summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
   193    expr: |
   194      time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})
   195      >
   196      10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
   197    for: 5m
   198    labels:
   199      severity: info
   200  - alert: ThanosNoRuleEvaluations
   201    annotations:
   202      description: Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.
   203      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosnoruleevaluations
   204      summary: Thanos Rule did not perform any rule evaluations.
   205    expr: |
   206      sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0
   207        and
   208      sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0
   209    for: 5m
   210    labels:
   211      severity: critical
   212  ```
   213  
   214  ## Store Gateway
   215  
   216  ```yaml mdox-exec="cat examples/tmp/thanos-store.yaml"
   217  name: thanos-store
   218  rules:
   219  - alert: ThanosStoreGrpcErrorRate
   220    annotations:
   221      description: Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.
   222      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoregrpcerrorrate
   223      summary: Thanos Store is failing to handle gRPC requests.
   224    expr: |
   225      (
   226        sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))
   227      /
   228        sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m]))
   229      * 100 > 5
   230      )
   231    for: 5m
   232    labels:
   233      severity: warning
   234  - alert: ThanosStoreSeriesGateLatencyHigh
   235    annotations:
   236      description: Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.
   237      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreseriesgatelatencyhigh
   238      summary: Thanos Store has high latency for store series gate requests.
   239    expr: |
   240      (
   241        histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2
   242      and
   243        sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0
   244      )
   245    for: 10m
   246    labels:
   247      severity: warning
   248  - alert: ThanosStoreBucketHighOperationFailures
   249    annotations:
   250      description: Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.
   251      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstorebuckethighoperationfailures
   252      summary: Thanos Store Bucket is failing to execute operations.
   253    expr: |
   254      (
   255        sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m]))
   256      /
   257        sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m]))
   258      * 100 > 5
   259      )
   260    for: 15m
   261    labels:
   262      severity: warning
   263  - alert: ThanosStoreObjstoreOperationLatencyHigh
   264    annotations:
   265      description: Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.
   266      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreobjstoreoperationlatencyhigh
   267      summary: Thanos Store is having high latency for bucket operations.
   268    expr: |
   269      (
   270        histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2
   271      and
   272        sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0
   273      )
   274    for: 10m
   275    labels:
   276      severity: warning
   277  ```
   278  
   279  ## Sidecar
   280  
   281  ```yaml mdox-exec="cat examples/tmp/thanos-sidecar.yaml"
   282  name: thanos-sidecar
   283  rules:
   284  - alert: ThanosSidecarBucketOperationsFailed
   285    annotations:
   286      description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
   287      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed
   288      summary: Thanos Sidecar bucket operations are failing
   289    expr: |
   290      sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0
   291    for: 5m
   292    labels:
   293      severity: critical
   294  - alert: ThanosSidecarNoConnectionToStartedPrometheus
   295    annotations:
   296      description: Thanos Sidecar {{$labels.instance}} is unhealthy.
   297      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus
   298      summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.
   299    expr: |
   300      thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
   301      AND on (namespace, pod)
   302      prometheus_tsdb_data_replay_duration_seconds != 0
   303    for: 5m
   304    labels:
   305      severity: critical
   306  ```
   307  
   308  ## Query
   309  
   310  ```yaml mdox-exec="cat examples/tmp/thanos-query.yaml"
   311  name: thanos-query
   312  rules:
   313  - alert: ThanosQueryHttpRequestQueryErrorRateHigh
   314    annotations:
   315      description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests.
   316      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryerrorratehigh
   317      summary: Thanos Query is failing to handle requests.
   318    expr: |
   319      (
   320        sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))
   321      /
   322        sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))
   323      ) * 100 > 5
   324    for: 5m
   325    labels:
   326      severity: critical
   327  - alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
   328    annotations:
   329      description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests.
   330      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryrangeerrorratehigh
   331      summary: Thanos Query is failing to handle requests.
   332    expr: |
   333      (
   334        sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))
   335      /
   336        sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))
   337      ) * 100 > 5
   338    for: 5m
   339    labels:
   340      severity: critical
   341  - alert: ThanosQueryGrpcServerErrorRate
   342    annotations:
   343      description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.
   344      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcservererrorrate
   345      summary: Thanos Query is failing to handle requests.
   346    expr: |
   347      (
   348        sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))
   349      /
   350        sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m]))
   351      * 100 > 5
   352      )
   353    for: 5m
   354    labels:
   355      severity: warning
   356  - alert: ThanosQueryGrpcClientErrorRate
   357    annotations:
   358      description: Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.
   359      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate
   360      summary: Thanos Query is failing to send requests.
   361    expr: |
   362      (
   363        sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m]))
   364      /
   365        sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))
   366      ) * 100 > 5
   367    for: 5m
   368    labels:
   369      severity: warning
   370  - alert: ThanosQueryHighDNSFailures
   371    annotations:
   372      description: Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.
   373      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhighdnsfailures
   374      summary: Thanos Query is having high number of DNS failures.
   375    expr: |
   376      (
   377        sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m]))
   378      /
   379        sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))
   380      ) * 100 > 1
   381    for: 15m
   382    labels:
   383      severity: warning
   384  - alert: ThanosQueryInstantLatencyHigh
   385    annotations:
   386      description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.
   387      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryinstantlatencyhigh
   388      summary: Thanos Query has high latency for queries.
   389    expr: |
   390      (
   391        histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40
   392      and
   393        sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0
   394      )
   395    for: 10m
   396    labels:
   397      severity: critical
   398  - alert: ThanosQueryRangeLatencyHigh
   399    annotations:
   400      description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.
   401      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryrangelatencyhigh
   402      summary: Thanos Query has high latency for queries.
   403    expr: |
   404      (
   405        histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90
   406      and
   407        sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0
   408      )
   409    for: 10m
   410    labels:
   411      severity: critical
   412  - alert: ThanosQueryOverload
   413    annotations:
   414      description: Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.
   415      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryoverload
   416      summary: Thanos query reaches its maximum capacity serving concurrent requests.
   417    expr: |
   418      (
   419        max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1
   420      )
   421    for: 15m
   422    labels:
   423      severity: warning
   424  ```
   425  
   426  ## Receive
   427  
   428  ```yaml mdox-exec="cat examples/tmp/thanos-receive.yaml"
   429  name: thanos-receive
   430  rules:
   431  - alert: ThanosReceiveHttpRequestErrorRateHigh
   432    annotations:
   433      description: Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.
   434      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequesterrorratehigh
   435      summary: Thanos Receive is failing to handle requests.
   436    expr: |
   437      (
   438        sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))
   439      /
   440        sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))
   441      ) * 100 > 5
   442    for: 5m
   443    labels:
   444      severity: critical
   445  - alert: ThanosReceiveHttpRequestLatencyHigh
   446    annotations:
   447      description: Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.
   448      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequestlatencyhigh
   449      summary: Thanos Receive has high HTTP requests latency.
   450    expr: |
   451      (
   452        histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10
   453      and
   454        sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0
   455      )
   456    for: 10m
   457    labels:
   458      severity: critical
   459  - alert: ThanosReceiveHighReplicationFailures
   460    annotations:
   461      description: Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.
   462      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighreplicationfailures
   463      summary: Thanos Receive is having high number of replication failures.
   464    expr: |
   465      thanos_receive_replication_factor > 1
   466        and
   467      (
   468        (
   469          sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m]))
   470        /
   471          sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))
   472        )
   473        >
   474        (
   475          max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1) / 2))
   476        /
   477          max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"})
   478        )
   479      ) * 100
   480    for: 5m
   481    labels:
   482      severity: warning
   483  - alert: ThanosReceiveHighForwardRequestFailures
   484    annotations:
   485      description: Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.
   486      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighforwardrequestfailures
   487      summary: Thanos Receive is failing to forward requests.
   488    expr: |
   489      (
   490        sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))
   491      /
   492        sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))
   493      ) * 100 > 20
   494    for: 5m
   495    labels:
   496      severity: info
   497  - alert: ThanosReceiveHighHashringFileRefreshFailures
   498    annotations:
   499      description: Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.
   500      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures
   501      summary: Thanos Receive is failing to refresh hasring file.
   502    expr: |
   503      (
   504        sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m]))
   505      /
   506        sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m]))
   507      > 0
   508      )
   509    for: 15m
   510    labels:
   511      severity: warning
   512  - alert: ThanosReceiveConfigReloadFailure
   513    annotations:
   514      description: Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.
   515      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure
   516      summary: Thanos Receive has not been able to reload configuration.
   517    expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) != 1
   518    for: 5m
   519    labels:
   520      severity: warning
   521  - alert: ThanosReceiveNoUpload
   522    annotations:
   523      description: Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.
   524      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload
   525      summary: Thanos Receive has not uploaded latest data to object storage.
   526    expr: |
   527      (up{job=~".*thanos-receive.*"} - 1)
   528      + on (job, instance) # filters to only alert on current instance last 3h
   529      (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0)
   530    for: 3h
   531    labels:
   532      severity: critical
   533  - alert: ThanosReceiveLimitsConfigReloadFailure
   534    annotations:
   535      description: Thanos Receive {{$labels.job}} has not been able to reload the limits configuration.
   536      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitsconfigreloadfailure
   537      summary: Thanos Receive has not been able to reload the limits configuration.
   538    expr: sum by(job) (increase(thanos_receive_limits_config_reload_err_total{job=~".*thanos-receive.*"}[5m])) > 0
   539    for: 5m
   540    labels:
   541      severity: warning
   542  - alert: ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate
   543    annotations:
   544      description: Thanos Receive {{$labels.job}} is failing for {{$value | humanize}}% of meta monitoring queries.
   545      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitshighmetamonitoringqueriesfailurerate
   546      summary: Thanos Receive has not been able to update the number of head series.
   547    expr: (sum by(job) (increase(thanos_receive_metamonitoring_failed_queries_total{job=~".*thanos-receive.*"}[5m])) / 20) * 100 > 20
   548    for: 5m
   549    labels:
   550      severity: warning
   551  - alert: ThanosReceiveTenantLimitedByHeadSeries
   552    annotations:
   553      description: Thanos Receive tenant {{$labels.tenant}} is limited by head series.
   554      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivetenantlimitedbyheadseries
   555      summary: A Thanos Receive tenant is limited by head series.
   556    expr: sum by(job, tenant) (increase(thanos_receive_head_series_limited_requests_total{job=~".*thanos-receive.*"}[5m])) > 0
   557    for: 5m
   558    labels:
   559      severity: warning
   560  ```
   561  
   562  ## Replicate
   563  
   564  ```yaml mdox-exec="cat examples/tmp/thanos-bucket-replicate.yaml"
   565  name: thanos-bucket-replicate
   566  rules:
   567  - alert: ThanosBucketReplicateErrorRate
   568    annotations:
   569      description: Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.
   570      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicateerrorrate
   571      summary: Thanos Replicate is failing to run.
   572    expr: |
   573      (
   574        sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m]))
   575      / on (job) group_left
   576        sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))
   577      ) * 100 >= 10
   578    for: 5m
   579    labels:
   580      severity: critical
   581  - alert: ThanosBucketReplicateRunLatency
   582    annotations:
   583      description: Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.
   584      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicaterunlatency
   585      summary: Thanos Replicate has a high latency for replicate operations.
   586    expr: |
   587      (
   588        histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20
   589      and
   590        sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0
   591      )
   592    for: 5m
   593    labels:
   594      severity: critical
   595  ```
   596  
   597  ## Extras
   598  
   599  ### Absent Rules
   600  
   601  ```yaml mdox-exec="cat examples/tmp/thanos-component-absent.yaml"
   602  name: thanos-component-absent
   603  rules:
   604  - alert: ThanosCompactIsDown
   605    annotations:
   606      description: ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.
   607      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactisdown
   608      summary: Thanos component has disappeared.
   609    expr: |
   610      absent(up{job=~".*thanos-compact.*"} == 1)
   611    for: 5m
   612    labels:
   613      severity: critical
   614  - alert: ThanosQueryIsDown
   615    annotations:
   616      description: ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.
   617      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryisdown
   618      summary: Thanos component has disappeared.
   619    expr: |
   620      absent(up{job=~".*thanos-query.*"} == 1)
   621    for: 5m
   622    labels:
   623      severity: critical
   624  - alert: ThanosReceiveIsDown
   625    annotations:
   626      description: ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.
   627      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveisdown
   628      summary: Thanos component has disappeared.
   629    expr: |
   630      absent(up{job=~".*thanos-receive.*"} == 1)
   631    for: 5m
   632    labels:
   633      severity: critical
   634  - alert: ThanosRuleIsDown
   635    annotations:
   636      description: ThanosRule has disappeared. Prometheus target for the component cannot be discovered.
   637      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleisdown
   638      summary: Thanos component has disappeared.
   639    expr: |
   640      absent(up{job=~".*thanos-rule.*"} == 1)
   641    for: 5m
   642    labels:
   643      severity: critical
   644  - alert: ThanosSidecarIsDown
   645    annotations:
   646      description: ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.
   647      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarisdown
   648      summary: Thanos component has disappeared.
   649    expr: |
   650      absent(up{job=~".*thanos-sidecar.*"} == 1)
   651    for: 5m
   652    labels:
   653      severity: critical
   654  - alert: ThanosStoreIsDown
   655    annotations:
   656      description: ThanosStore has disappeared. Prometheus target for the component cannot be discovered.
   657      runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreisdown
   658      summary: Thanos component has disappeared.
   659    expr: |
   660      absent(up{job=~".*thanos-store.*"} == 1)
   661    for: 5m
   662    labels:
   663      severity: critical
   664  ```