github.com/thanos-io/thanos@v0.32.5/docs/operating/troubleshooting.md (about)

     1  # Troubleshooting; Common cases
     2  
     3  ## Overlaps
     4  
     5  **Block overlap**: Set of blocks with exactly the same external labels in meta.json and for the same time or overlapping time period.
     6  
     7  Thanos is designed to never end up with overlapped blocks. This means that (uncontrolled) block overlap should never happen in a healthy and well configured Thanos system. That's why there is no automatic repair for this. Since it's an unexpected incident:
     8  * All reader components like Store Gateway will handle this gracefully (overlapped samples will be deduplicated).
     9  * Thanos compactor will stop all activities and HALT or crash (with metric and will error log). This is because it cannot perform compactions and downsampling. In the overlap situation, we know something unexpected happened (e.g manual block upload, some malformed data etc), so it's safer to stop or crash loop (it's configurable).
    10  
    11  Let's take an example:
    12  
    13  - `msg="critical error detected; halting" err="compaction failed: compaction: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1555128000000, maxt: 1555135200000, range: 2h0m0s, blocks: 2]: <ulid: 01D94ZRM050JQK6NDYNVBNR6WQ, mint: 1555128000000, maxt: 1555135200000, range: 2h0m0s>, <ulid: 01D8AQXTF2X914S419TYTD4P5B, mint: 1555128000000, maxt: 1555135200000, range: 2h0m0s>`
    14  
    15  In this halted example, we can read that compactor detected 2 overlapped blocks. What's interesting is that those two blocks look like they are "similar". They are exactly for the same period of time. This might mean that potential reasons are:
    16  
    17  * Duplicated upload with different ULID (non-persistent storage for Prometheus can cause this)
    18  * 2 Prometheus instances are misconfigured and they are uploading the data with exactly the same external labels. This is wrong, they should be unique.
    19  
    20  Checking producers log for such ULID, and checking meta.json (e.g if sample stats are the same or not) helps. Checksum the index and [chunks files](../design.md#chunk-file) as well to reveal if data is exactly the same, thus ok to be removed manually. You may find `scripts/thanos-block.jq` script useful when inspecting `meta.json` files, as it translates timestamps to human-readable form.
    21  
    22  ### Reasons
    23  
    24  - You are running Thanos (sidecar, ruler or receive) older than 0.13.0. During transient upload errors there is a possibility to have overlaps caused by the compactor not being aware of all blocks See: [this](https://github.com/thanos-io/thanos/issues/2753)
    25  - Misconfiguraiton of sidecar/ruler: Same external labels or no external labels across many block producers.
    26  - Running multiple compactors for single block "stream", even for short duration.
    27  - Manually uploading blocks to the bucket.
    28  - Eventually consistent block storage until we fully implement [RW for bucket](../proposals-done/201901-read-write-operations-bucket.md)
    29  
    30  ### Solutions
    31  
    32  - Upgrade sidecar, ruler and receive to 0.13.0+
    33  - Compactor can be blocked for some time, but if it is urgent. Mitigate by removing overlap or better: Backing up somewhere else (you can rename block ULID to non-ulid).
    34  - Who uploaded the block? Search for logs with this ULID across all sidecars/rulers. Check access logs to object storage. Check debug/metas or meta.json of problematic block to see how blocks looks like and what is the `source`.
    35  - Determine what you misconfigured.
    36  - If all looks sane and you double-checked everything: Then post an issue on Github, Bugs can happen but we heavily test against such problems.
    37  
    38  # Sidecar
    39  
    40  ## Connection Refused
    41  
    42  ### Description
    43  
    44  ```shell
    45  level=warn ts=2020-04-18T03:07:00.512902927Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="request flags against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp 127.0.0.1:9090: connect: connection refused"
    46  ```
    47  
    48  * This issue might happen when thanos is not configured properly.
    49  
    50  ### Possible Solution
    51  
    52  * Make sure that prometheus is running while thanos is started. The `connection_refused` states that there is no server running in the `localhost:9090`, which is the address for prometheus in this case.
    53  
    54  ## Thanos not identifying Prometheus
    55  
    56  ### Description
    57  
    58  ```shell
    59  level=info ts=2020-04-18T03:16:32.158536285Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="no external labels configured on Prometheus server, uniquely identifying external labels must be configured"
    60  ```
    61  
    62  * This issue happens when thanos doesn't recognise prometheus
    63  
    64  ### Possible Solution
    65  
    66  * Thanos requires **unique** `external_labels` for further processing. So make sure that the `external_labels` are not empty and globally unique in the prometheus config file. A possible example -
    67  
    68  ```yml
    69  global:
    70    external_labels:
    71      cluster: eu1
    72      replica: 0
    73  ```
    74  
    75  # Receiver
    76  
    77  ## Out-of-bound Error
    78  
    79  ### Description
    80  
    81  #### Thanos Receiver Log
    82  
    83  ```shell
    84  level=warn ts=2021-05-01T04:57:12.249429787Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_droppped=47
    85  ```
    86  
    87  ### Root Cause
    88  
    89  - "Out-of-bound" error occurs when the timestamp of the to-be-written sample is lower than the minimum acceptable timestamp of the TSDB head.
    90  
    91  ### Possible Cause
    92  
    93  1. Thanos Receiver was stopped previously and is just resumed, remote Prometheus starts to write from the oldest sample, which is too old to be digested and hence rejected.
    94  2. Thanos Receiver does not have enough compute resources to ingest the remote write data (is too slow). The latest ingested sample is gradually falling behind the latest scraped samples.
    95  
    96  ### Diagnostic and Possible Solution
    97  
    98  - Check the pod history of Thanos Receiver to see if it is case #1.
    99  - For case #2, if you installed Prometheus using the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) helm chart from the Prometheus Community, you can check the "Prometheus / Remote Write dashboard". If the Rate\[5m\] is above 0 for a long period, it is case #2 and you should consider adding replica count or resources to Thanos Receiver.
   100  
   101  <img src="../img/thanos_receiver_troubleshoot_grafana_remote_write.png" class="img-fluid" alt="Example Grafana dashboard showing the falling-behind remote write case"/>
   102  
   103  ## Out-of-order Samples Error
   104  
   105  ### Description
   106  
   107  #### Thanos Receiver Log
   108  
   109  ```shell
   110  level=warn ts=2021-05-01T05:02:23.596022921Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=14
   111  ```
   112  
   113  ### Root Cause
   114  
   115  - TSDB expects to write samples in chronological order for each series.
   116  - A sample with timestamp t1 is sent to the Thanos Receiver and accepted, any sample with **timestamp t < t1** and **identical label set** being sent to the receiver after this will be determined as out-of-order sample.
   117  
   118  ### Possible Cause
   119  
   120  - Remote Prometheus is running in high availability mode (more than 1 replica are running). But the replica_external_label_name is not correctly configured (e.g. empty).
   121  
   122  <img src="../img/thanos_receiver_troubleshoot_empty_replica_external_label_name.drawio.png" class="img-fluid" alt="Example topology diagram of out-of-order error case caused by empty replica_external_label_name"/>
   123  
   124  - Remote Prometheus is running in a federation
   125    - the remote-writing Prometheus is running in HA
   126    - federation has both honor_label = true and honor_timestamp = true
   127    - all layers of Prometheus is using the same replica_external_label_name (e.g. the default "prometheus_replica")
   128  
   129  <img src="../img/thanos_receiver_troubleshoot_federation_idential_replica_name.drawio.png" class="img-fluid" alt="Example topology diagram of out-of-order error case caused by misconfigured Prometheus federation"/>
   130  
   131  - There are multiple deployments of remote Prometheus, their external_labels are identical (e.g. all being empty), and they have metrics with no unique label (e.g. aggregated cluster CPU usage).
   132  
   133  <img src="../img/thanos_receiver_troubleshoot_no_external_labels.drawio.png" class="img-fluid" alt="Example topology diagram of out-of-order error case caused by missing external_labels"/>
   134  
   135  ### Diagnostic
   136  
   137  - Enable debug log on Thanos Receiver (you may need to update cli parameter or helm chart values, depending on how you deployed Thanos Receiver). You can inspect the label set of the out-of-order sample in the debug log of Thanos Receiver, it may provide you some insight.
   138  - Inspect the topology and configuration of your Prometheus deployment, see if they match the above possible causes.
   139  
   140  ### Possible Solution
   141  
   142  - Configure distinct sets of external_labels for each remote Prometheus deployments.
   143  - Use different replica_external_label_name for each layer of Prometheus federation (e.g. layer 1: lesser_prometheus_replica, layer 2: main_prometheus_replica).
   144  - Use static endpoint based federation in Prometheus if the lesser Prometheus is in HA (service monitor based federation will pull metrics from all lesser Prometheus instances).