github.com/thanos-io/thanos@v0.32.5/docs/proposals-done/201909-thanos-sharding.md (about)

     1  ---
     2  type: proposal
     3  title: Thanos Sharding for Long Term Retention Storage
     4  status: complete
     5  owner: bwplotka
     6  menu: proposals-done
     7  ---
     8  
     9  ### Related Tickets
    10  
    11  * https://github.com/thanos-io/thanos/issues/1034 (main)
    12  * https://github.com/thanos-io/thanos/pull/1245 (attempt on compactor)
    13  * https://github.com/thanos-io/thanos/pull/1059 (attempt on store)
    14  * https://github.com/thanos-io/thanos/issues/1318 (subpaths, multitenancy)
    15  * Store issues because of large bucket:
    16    * https://github.com/thanos-io/thanos/issues/814
    17    * https://github.com/thanos-io/thanos/issues/1455
    18  
    19  ## Summary
    20  
    21  This document describes the motivation and design of the sharding of Thanos components in terms of operations against object storage. Additionally we touch possibility for smarter pre-filtering of shards on the Querier.
    22  
    23  ## Motivation
    24  
    25  Currently all components that read from object store assume that all the operations and functionality should be done based on **all** the available blocks that are present in the certain bucket's root directory.
    26  
    27  This is in most cases totally fine, however with time and allowance of storing blocks from multiple `Sources` into the same bucket, the number of objects in a bucket can grow drastically.
    28  
    29  This means that with time you might want to scale out certain components e.g:
    30  
    31  * Compactor: Larger number of objects does not matter much, however compactor has to scale (CPU, network) with number of Sources pushing blocks to the object storage. If you have multiple sources handled by the same compactor, with slower network and CPU you might not compact/downsample quick enough to cope with incoming blocks.
    32    * This happens a lot if no compactor is deployed for longer periods and thus has to quickly catch up with large number of blocks (e.g couple of months).
    33  * Store Gateway: Queries against store gateway which are touching large number of Sources might be expensive, so it has to scale up with number of Sources if we assume those queries.
    34    * Orthogonally we did not advertise any labels on Store Gateway's Info. This means that querier was not able to do any pre-filtering, so all store gateways in system are always touched for each query.
    35  
    36  ### Reminder: What is a Source
    37  
    38  `Source` is a any component that creates new metrics in a form of Thanos TSDB blocks uploaded to the object storage. We differentiate Sources by `external labels`. Having unique sources has several benefits:
    39  
    40  * Sources does not need to inject "global" source labels to all metrics (like `cluster, env, replica`). Those all the same for all metrics produced by source, we can assume that whole block has those.
    41  * We can track what blocks are "duplicates": e.g in HA groups, where 2 replicas of Prometheus-es are scraping the same targets.
    42  * We can track what source produced metrics in case of problems if any.
    43  
    44  Example Sources are: Sidecar, Rule, Thanos Receive.
    45  
    46  ## Use Cases
    47  
    48  We can then define couple of use cases (some of them where already reported by users):
    49  
    50  * Scaling out / Sharding Compactor.
    51  * Scaling out / Sharding Store Gateways.
    52  * Allowing pre-filtering of queries inside the Querier - thanks to labels advertised in Info call for all StoreAPIs (!).
    53  * Filtering out portion of data: This is useful if you want to ignore suddenly some blocks in case of error/investigation/security.
    54  * Different priority for different Sources.
    55    * Some Sources might be more important then others. This might mean different availability and performance SLOs. Being able to split object storage operations across different components helps with that. NOTE: We mean here a per process priority (e.g one Store Gateway being more important then other).
    56  
    57  ## Goals of this design
    58  
    59  Our goals for this design it to find and implement solution for:
    60  
    61  * Sharding browsing metrics from the object storage:
    62    * e.g Selecting what blocks Store Gateway should expose.
    63  * Minimal pre-filtering which shards Querier should touch during query.
    64  * Sharding compaction/downsampling of metrics in the object storage.
    65    * NOTE: We need to be really careful to not have 2 compactors working on the same Source. This means careful upgrades/configuration changes. There must be documentation for that at least.
    66  
    67  ## No Goals
    68  
    69  * Time partitioning:
    70    * Store GW allows that already.
    71  * "Merging" sources together virtually for downsampling/compactions across single Source that just changes external labels.
    72  * [Bloom filters](https://github.com/thanos-io/thanos/issues/1611) (or custom advertised labels) for application metrics within blocks; advertise labels manipulation.
    73  * Allow all permutations of object storage setups:
    74    * User uses multiple object storages for all sources?
    75    * User uses single object storage for all sources?
    76    * User uses any mix of object storages for sources. They put all in different subdirs/subpaths.
    77  * Add coordination or reconciliation in case of multi Compactor run on the same "sources" or any form of Compactor HA (e.g active passive)
    78    * Requires separate design. ** multi-buckets/multi-prefix support. This is orthogonal.
    79  
    80  ## Proposal
    81  
    82  On each component that works on the object storage (e.g Store GW and Compactor), add `--selector.relabel-config` (and corresponding `--selector.relabel-config-file`) that will be used to filter out what blocks should be selected for operations. Examples:
    83  
    84  * We want to run Compactor only for blocks with `external_labels` being `cluster=A`. We will run second Compactor for blocks with `cluster=B` external labels.
    85  * We want to browse only blocks with `external_labels` being `cluster=A` from object storage. We will run StoreGateway with selector of `cluster=A` from external labels of blocks.
    86  
    87  ### Relabelling
    88  
    89  Similar to [promtail](https://grafana.com/docs/loki/latest/clients/promtail/scraping/#relabeling) this config will follow native [Prometheus relabel-config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config) syntax.
    90  
    91  The relabel config will define filtering process done on **every** synchronization with object storage.
    92  
    93  We will allow potentially manipulating with several of inputs:
    94  
    95  * `__block_id`
    96  * External labels:
    97    * `<name>`
    98  * `__block_objstore_bucket_endpoint`
    99  * `__block_objstore_bucket_name`
   100  * `__block_objstore_bucket_path`
   101  
   102  Output:
   103  
   104  * If output is empty, drop block.
   105  
   106  By default, on empty relabel-config, all external labels are assumed. Intuitively blocks without any external labels will be ignored.
   107  
   108  All blocks should compose as set of labels to advertise. The input should be based from original meta files. NOT from relabelling. The reasoning is covered in [`Next Steps`](#future-work) section
   109  
   110  Example usages would be:
   111  
   112  * Drop blocks which contains external labels cluster=A
   113  
   114  ```yaml
   115  - action: drop
   116    regex: "A"
   117    source_labels:
   118    - cluster
   119  ```
   120  
   121  * Keep only blocks which contains external labels cluster=A
   122  
   123  ```yaml
   124  - action: keep
   125    regex: "A"
   126    source_labels:
   127    - cluster
   128  ```
   129  
   130  ## Work Plan
   131  
   132  * Add/import relabel config into Thanos, add relevant logic.
   133  * Hook it for selecting blocks on Store Gateway
   134    * Advertise original labels of "approved" blocs on resulted external labels.
   135  * Hook it for selecting blocks on Compactor.
   136    * Add documentation about following concern: Care must be taken with changing selection for compactor to unsure only single compactor ever running over each Source's blocks.
   137  
   138  ## Future Work
   139  
   140  * Add coordination or reconciliation in case of multi Compactor run on the same "sources" or any form of Compactor HA (e.g active passive)
   141    * Requires separate design.
   142  * Allow bloom-like filters: https://github.com/thanos-io/thanos/issues/1611
   143  * Extend relabelling to allow adjusting advertised labels (e.g for Store Gateway). For example changing external labels as they might change over time for the same source that would allow to hide that change for user querying the data.
   144  
   145  For example:
   146  
   147  * Drop cluster label from external labels for each blocks (if present).
   148  
   149  ```yaml
   150  - action: labeldrop
   151    regex: "^cluster$"
   152  
   153  ```
   154  
   155  * Add `datacenter=ABC` external label to the result.
   156  
   157  ```yaml
   158  - target_label: custom_label
   159    replacement: ABC
   160  ```
   161  
   162  Note that current relabel implementation assumes there is always single value for each label. This would need to be adjusted for.