github.com/thanos-io/thanos@v0.32.5/examples/alerts/alerts.md (about) 1 # Alerts 2 3 Here are some example alerts configured for Kubernetes environment. 4 5 ## Compaction 6 7 ```yaml mdox-exec="cat examples/tmp/thanos-compact.yaml" 8 name: thanos-compact 9 rules: 10 - alert: ThanosCompactMultipleRunning 11 annotations: 12 description: No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running. 13 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactmultiplerunning 14 summary: Thanos Compact has multiple instances running. 15 expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1 16 for: 5m 17 labels: 18 severity: warning 19 - alert: ThanosCompactHalted 20 annotations: 21 description: Thanos Compact {{$labels.job}} has failed to run and now is halted. 22 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthalted 23 summary: Thanos Compact has failed to run and is now halted. 24 expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1 25 for: 5m 26 labels: 27 severity: warning 28 - alert: ThanosCompactHighCompactionFailures 29 annotations: 30 description: Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions. 31 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthighcompactionfailures 32 summary: Thanos Compact is failing to execute compactions. 33 expr: | 34 ( 35 sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) 36 / 37 sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) 38 * 100 > 5 39 ) 40 for: 15m 41 labels: 42 severity: warning 43 - alert: ThanosCompactBucketHighOperationFailures 44 annotations: 45 description: Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. 46 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactbuckethighoperationfailures 47 summary: Thanos Compact Bucket is having a high number of operation failures. 48 expr: | 49 ( 50 sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) 51 / 52 sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) 53 * 100 > 5 54 ) 55 for: 15m 56 labels: 57 severity: warning 58 - alert: ThanosCompactHasNotRun 59 annotations: 60 description: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours. 61 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthasnotrun 62 summary: Thanos Compact has not uploaded anything for last 24 hours. 63 expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24 64 labels: 65 severity: warning 66 ``` 67 68 ## Ruler 69 70 For Thanos Ruler we run some alerts in local Prometheus, to make sure that Thanos Ruler is working: 71 72 ```yaml mdox-exec="cat examples/tmp/thanos-rule.yaml" 73 name: thanos-rule 74 rules: 75 - alert: ThanosRuleQueueIsDroppingAlerts 76 annotations: 77 description: Thanos Rule {{$labels.instance}} is failing to queue alerts. 78 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeueisdroppingalerts 79 summary: Thanos Rule is failing to queue alerts. 80 expr: | 81 sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0 82 for: 5m 83 labels: 84 severity: critical 85 - alert: ThanosRuleSenderIsFailingAlerts 86 annotations: 87 description: Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager. 88 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulesenderisfailingalerts 89 summary: Thanos Rule is failing to send alerts to alertmanager. 90 expr: | 91 sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0 92 for: 5m 93 labels: 94 severity: critical 95 - alert: ThanosRuleHighRuleEvaluationFailures 96 annotations: 97 description: Thanos Rule {{$labels.instance}} is failing to evaluate rules. 98 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationfailures 99 summary: Thanos Rule is failing to evaluate rules. 100 expr: | 101 ( 102 sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) 103 / 104 sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) 105 * 100 > 5 106 ) 107 for: 5m 108 labels: 109 severity: critical 110 - alert: ThanosRuleHighRuleEvaluationWarnings 111 annotations: 112 description: Thanos Rule {{$labels.instance}} has high number of evaluation warnings. 113 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationwarnings 114 summary: Thanos Rule has high number of evaluation warnings. 115 expr: | 116 sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0 117 for: 15m 118 labels: 119 severity: info 120 - alert: ThanosRuleRuleEvaluationLatencyHigh 121 annotations: 122 description: Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}. 123 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleruleevaluationlatencyhigh 124 summary: Thanos Rule has high rule evaluation latency. 125 expr: | 126 ( 127 sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) 128 > 129 sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}) 130 ) 131 for: 5m 132 labels: 133 severity: warning 134 - alert: ThanosRuleGrpcErrorRate 135 annotations: 136 description: Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. 137 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulegrpcerrorrate 138 summary: Thanos Rule is failing to handle grpc requests. 139 expr: | 140 ( 141 sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m])) 142 / 143 sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) 144 * 100 > 5 145 ) 146 for: 5m 147 labels: 148 severity: warning 149 - alert: ThanosRuleConfigReloadFailure 150 annotations: 151 description: Thanos Rule {{$labels.job}} has not been able to reload its configuration. 152 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleconfigreloadfailure 153 summary: Thanos Rule has not been able to reload configuration. 154 expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) != 1 155 for: 5m 156 labels: 157 severity: info 158 - alert: ThanosRuleQueryHighDNSFailures 159 annotations: 160 description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints. 161 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeryhighdnsfailures 162 summary: Thanos Rule is having high number of DNS failures. 163 expr: | 164 ( 165 sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) 166 / 167 sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) 168 * 100 > 1 169 ) 170 for: 15m 171 labels: 172 severity: warning 173 - alert: ThanosRuleAlertmanagerHighDNSFailures 174 annotations: 175 description: Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints. 176 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulealertmanagerhighdnsfailures 177 summary: Thanos Rule is having high number of DNS failures. 178 expr: | 179 ( 180 sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) 181 / 182 sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) 183 * 100 > 1 184 ) 185 for: 15m 186 labels: 187 severity: warning 188 - alert: ThanosRuleNoEvaluationFor10Intervals 189 annotations: 190 description: Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval. 191 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals 192 summary: Thanos Rule has rule groups that did not evaluate for 10 intervals. 193 expr: | 194 time() - max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"}) 195 > 196 10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}) 197 for: 5m 198 labels: 199 severity: info 200 - alert: ThanosNoRuleEvaluations 201 annotations: 202 description: Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes. 203 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosnoruleevaluations 204 summary: Thanos Rule did not perform any rule evaluations. 205 expr: | 206 sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0 207 and 208 sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0 209 for: 5m 210 labels: 211 severity: critical 212 ``` 213 214 ## Store Gateway 215 216 ```yaml mdox-exec="cat examples/tmp/thanos-store.yaml" 217 name: thanos-store 218 rules: 219 - alert: ThanosStoreGrpcErrorRate 220 annotations: 221 description: Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. 222 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoregrpcerrorrate 223 summary: Thanos Store is failing to handle gRPC requests. 224 expr: | 225 ( 226 sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m])) 227 / 228 sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) 229 * 100 > 5 230 ) 231 for: 5m 232 labels: 233 severity: warning 234 - alert: ThanosStoreSeriesGateLatencyHigh 235 annotations: 236 description: Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests. 237 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreseriesgatelatencyhigh 238 summary: Thanos Store has high latency for store series gate requests. 239 expr: | 240 ( 241 histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 242 and 243 sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0 244 ) 245 for: 10m 246 labels: 247 severity: warning 248 - alert: ThanosStoreBucketHighOperationFailures 249 annotations: 250 description: Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. 251 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstorebuckethighoperationfailures 252 summary: Thanos Store Bucket is failing to execute operations. 253 expr: | 254 ( 255 sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) 256 / 257 sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) 258 * 100 > 5 259 ) 260 for: 15m 261 labels: 262 severity: warning 263 - alert: ThanosStoreObjstoreOperationLatencyHigh 264 annotations: 265 description: Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations. 266 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreobjstoreoperationlatencyhigh 267 summary: Thanos Store is having high latency for bucket operations. 268 expr: | 269 ( 270 histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 271 and 272 sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0 273 ) 274 for: 10m 275 labels: 276 severity: warning 277 ``` 278 279 ## Sidecar 280 281 ```yaml mdox-exec="cat examples/tmp/thanos-sidecar.yaml" 282 name: thanos-sidecar 283 rules: 284 - alert: ThanosSidecarBucketOperationsFailed 285 annotations: 286 description: Thanos Sidecar {{$labels.instance}} bucket operations are failing 287 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed 288 summary: Thanos Sidecar bucket operations are failing 289 expr: | 290 sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0 291 for: 5m 292 labels: 293 severity: critical 294 - alert: ThanosSidecarNoConnectionToStartedPrometheus 295 annotations: 296 description: Thanos Sidecar {{$labels.instance}} is unhealthy. 297 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus 298 summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. 299 expr: | 300 thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 301 AND on (namespace, pod) 302 prometheus_tsdb_data_replay_duration_seconds != 0 303 for: 5m 304 labels: 305 severity: critical 306 ``` 307 308 ## Query 309 310 ```yaml mdox-exec="cat examples/tmp/thanos-query.yaml" 311 name: thanos-query 312 rules: 313 - alert: ThanosQueryHttpRequestQueryErrorRateHigh 314 annotations: 315 description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests. 316 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryerrorratehigh 317 summary: Thanos Query is failing to handle requests. 318 expr: | 319 ( 320 sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m])) 321 / 322 sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m])) 323 ) * 100 > 5 324 for: 5m 325 labels: 326 severity: critical 327 - alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh 328 annotations: 329 description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests. 330 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryrangeerrorratehigh 331 summary: Thanos Query is failing to handle requests. 332 expr: | 333 ( 334 sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m])) 335 / 336 sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m])) 337 ) * 100 > 5 338 for: 5m 339 labels: 340 severity: critical 341 - alert: ThanosQueryGrpcServerErrorRate 342 annotations: 343 description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. 344 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcservererrorrate 345 summary: Thanos Query is failing to handle requests. 346 expr: | 347 ( 348 sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m])) 349 / 350 sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) 351 * 100 > 5 352 ) 353 for: 5m 354 labels: 355 severity: warning 356 - alert: ThanosQueryGrpcClientErrorRate 357 annotations: 358 description: Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests. 359 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate 360 summary: Thanos Query is failing to send requests. 361 expr: | 362 ( 363 sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m])) 364 / 365 sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m])) 366 ) * 100 > 5 367 for: 5m 368 labels: 369 severity: warning 370 - alert: ThanosQueryHighDNSFailures 371 annotations: 372 description: Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints. 373 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhighdnsfailures 374 summary: Thanos Query is having high number of DNS failures. 375 expr: | 376 ( 377 sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) 378 / 379 sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m])) 380 ) * 100 > 1 381 for: 15m 382 labels: 383 severity: warning 384 - alert: ThanosQueryInstantLatencyHigh 385 annotations: 386 description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries. 387 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryinstantlatencyhigh 388 summary: Thanos Query has high latency for queries. 389 expr: | 390 ( 391 histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 392 and 393 sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0 394 ) 395 for: 10m 396 labels: 397 severity: critical 398 - alert: ThanosQueryRangeLatencyHigh 399 annotations: 400 description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries. 401 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryrangelatencyhigh 402 summary: Thanos Query has high latency for queries. 403 expr: | 404 ( 405 histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 406 and 407 sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0 408 ) 409 for: 10m 410 labels: 411 severity: critical 412 - alert: ThanosQueryOverload 413 annotations: 414 description: Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support. 415 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryoverload 416 summary: Thanos query reaches its maximum capacity serving concurrent requests. 417 expr: | 418 ( 419 max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1 420 ) 421 for: 15m 422 labels: 423 severity: warning 424 ``` 425 426 ## Receive 427 428 ```yaml mdox-exec="cat examples/tmp/thanos-receive.yaml" 429 name: thanos-receive 430 rules: 431 - alert: ThanosReceiveHttpRequestErrorRateHigh 432 annotations: 433 description: Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. 434 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequesterrorratehigh 435 summary: Thanos Receive is failing to handle requests. 436 expr: | 437 ( 438 sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m])) 439 / 440 sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m])) 441 ) * 100 > 5 442 for: 5m 443 labels: 444 severity: critical 445 - alert: ThanosReceiveHttpRequestLatencyHigh 446 annotations: 447 description: Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests. 448 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequestlatencyhigh 449 summary: Thanos Receive has high HTTP requests latency. 450 expr: | 451 ( 452 histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 453 and 454 sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0 455 ) 456 for: 10m 457 labels: 458 severity: critical 459 - alert: ThanosReceiveHighReplicationFailures 460 annotations: 461 description: Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests. 462 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighreplicationfailures 463 summary: Thanos Receive is having high number of replication failures. 464 expr: | 465 thanos_receive_replication_factor > 1 466 and 467 ( 468 ( 469 sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) 470 / 471 sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m])) 472 ) 473 > 474 ( 475 max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1) / 2)) 476 / 477 max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}) 478 ) 479 ) * 100 480 for: 5m 481 labels: 482 severity: warning 483 - alert: ThanosReceiveHighForwardRequestFailures 484 annotations: 485 description: Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests. 486 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighforwardrequestfailures 487 summary: Thanos Receive is failing to forward requests. 488 expr: | 489 ( 490 sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m])) 491 / 492 sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m])) 493 ) * 100 > 20 494 for: 5m 495 labels: 496 severity: info 497 - alert: ThanosReceiveHighHashringFileRefreshFailures 498 annotations: 499 description: Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed. 500 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures 501 summary: Thanos Receive is failing to refresh hasring file. 502 expr: | 503 ( 504 sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) 505 / 506 sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) 507 > 0 508 ) 509 for: 15m 510 labels: 511 severity: warning 512 - alert: ThanosReceiveConfigReloadFailure 513 annotations: 514 description: Thanos Receive {{$labels.job}} has not been able to reload hashring configurations. 515 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure 516 summary: Thanos Receive has not been able to reload configuration. 517 expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) != 1 518 for: 5m 519 labels: 520 severity: warning 521 - alert: ThanosReceiveNoUpload 522 annotations: 523 description: Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage. 524 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload 525 summary: Thanos Receive has not uploaded latest data to object storage. 526 expr: | 527 (up{job=~".*thanos-receive.*"} - 1) 528 + on (job, instance) # filters to only alert on current instance last 3h 529 (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0) 530 for: 3h 531 labels: 532 severity: critical 533 - alert: ThanosReceiveLimitsConfigReloadFailure 534 annotations: 535 description: Thanos Receive {{$labels.job}} has not been able to reload the limits configuration. 536 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitsconfigreloadfailure 537 summary: Thanos Receive has not been able to reload the limits configuration. 538 expr: sum by(job) (increase(thanos_receive_limits_config_reload_err_total{job=~".*thanos-receive.*"}[5m])) > 0 539 for: 5m 540 labels: 541 severity: warning 542 - alert: ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate 543 annotations: 544 description: Thanos Receive {{$labels.job}} is failing for {{$value | humanize}}% of meta monitoring queries. 545 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitshighmetamonitoringqueriesfailurerate 546 summary: Thanos Receive has not been able to update the number of head series. 547 expr: (sum by(job) (increase(thanos_receive_metamonitoring_failed_queries_total{job=~".*thanos-receive.*"}[5m])) / 20) * 100 > 20 548 for: 5m 549 labels: 550 severity: warning 551 - alert: ThanosReceiveTenantLimitedByHeadSeries 552 annotations: 553 description: Thanos Receive tenant {{$labels.tenant}} is limited by head series. 554 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivetenantlimitedbyheadseries 555 summary: A Thanos Receive tenant is limited by head series. 556 expr: sum by(job, tenant) (increase(thanos_receive_head_series_limited_requests_total{job=~".*thanos-receive.*"}[5m])) > 0 557 for: 5m 558 labels: 559 severity: warning 560 ``` 561 562 ## Replicate 563 564 ```yaml mdox-exec="cat examples/tmp/thanos-bucket-replicate.yaml" 565 name: thanos-bucket-replicate 566 rules: 567 - alert: ThanosBucketReplicateErrorRate 568 annotations: 569 description: Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed. 570 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicateerrorrate 571 summary: Thanos Replicate is failing to run. 572 expr: | 573 ( 574 sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m])) 575 / on (job) group_left 576 sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m])) 577 ) * 100 >= 10 578 for: 5m 579 labels: 580 severity: critical 581 - alert: ThanosBucketReplicateRunLatency 582 annotations: 583 description: Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations. 584 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicaterunlatency 585 summary: Thanos Replicate has a high latency for replicate operations. 586 expr: | 587 ( 588 histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 589 and 590 sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0 591 ) 592 for: 5m 593 labels: 594 severity: critical 595 ``` 596 597 ## Extras 598 599 ### Absent Rules 600 601 ```yaml mdox-exec="cat examples/tmp/thanos-component-absent.yaml" 602 name: thanos-component-absent 603 rules: 604 - alert: ThanosCompactIsDown 605 annotations: 606 description: ThanosCompact has disappeared. Prometheus target for the component cannot be discovered. 607 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactisdown 608 summary: Thanos component has disappeared. 609 expr: | 610 absent(up{job=~".*thanos-compact.*"} == 1) 611 for: 5m 612 labels: 613 severity: critical 614 - alert: ThanosQueryIsDown 615 annotations: 616 description: ThanosQuery has disappeared. Prometheus target for the component cannot be discovered. 617 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryisdown 618 summary: Thanos component has disappeared. 619 expr: | 620 absent(up{job=~".*thanos-query.*"} == 1) 621 for: 5m 622 labels: 623 severity: critical 624 - alert: ThanosReceiveIsDown 625 annotations: 626 description: ThanosReceive has disappeared. Prometheus target for the component cannot be discovered. 627 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveisdown 628 summary: Thanos component has disappeared. 629 expr: | 630 absent(up{job=~".*thanos-receive.*"} == 1) 631 for: 5m 632 labels: 633 severity: critical 634 - alert: ThanosRuleIsDown 635 annotations: 636 description: ThanosRule has disappeared. Prometheus target for the component cannot be discovered. 637 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleisdown 638 summary: Thanos component has disappeared. 639 expr: | 640 absent(up{job=~".*thanos-rule.*"} == 1) 641 for: 5m 642 labels: 643 severity: critical 644 - alert: ThanosSidecarIsDown 645 annotations: 646 description: ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered. 647 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarisdown 648 summary: Thanos component has disappeared. 649 expr: | 650 absent(up{job=~".*thanos-sidecar.*"} == 1) 651 for: 5m 652 labels: 653 severity: critical 654 - alert: ThanosStoreIsDown 655 annotations: 656 description: ThanosStore has disappeared. Prometheus target for the component cannot be discovered. 657 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreisdown 658 summary: Thanos component has disappeared. 659 expr: | 660 absent(up{job=~".*thanos-store.*"} == 1) 661 for: 5m 662 labels: 663 severity: critical 664 ```