github.com/thanos-io/thanos@v0.32.5/docs/proposals-done/201909-thanos-sharding.md (about) 1 --- 2 type: proposal 3 title: Thanos Sharding for Long Term Retention Storage 4 status: complete 5 owner: bwplotka 6 menu: proposals-done 7 --- 8 9 ### Related Tickets 10 11 * https://github.com/thanos-io/thanos/issues/1034 (main) 12 * https://github.com/thanos-io/thanos/pull/1245 (attempt on compactor) 13 * https://github.com/thanos-io/thanos/pull/1059 (attempt on store) 14 * https://github.com/thanos-io/thanos/issues/1318 (subpaths, multitenancy) 15 * Store issues because of large bucket: 16 * https://github.com/thanos-io/thanos/issues/814 17 * https://github.com/thanos-io/thanos/issues/1455 18 19 ## Summary 20 21 This document describes the motivation and design of the sharding of Thanos components in terms of operations against object storage. Additionally we touch possibility for smarter pre-filtering of shards on the Querier. 22 23 ## Motivation 24 25 Currently all components that read from object store assume that all the operations and functionality should be done based on **all** the available blocks that are present in the certain bucket's root directory. 26 27 This is in most cases totally fine, however with time and allowance of storing blocks from multiple `Sources` into the same bucket, the number of objects in a bucket can grow drastically. 28 29 This means that with time you might want to scale out certain components e.g: 30 31 * Compactor: Larger number of objects does not matter much, however compactor has to scale (CPU, network) with number of Sources pushing blocks to the object storage. If you have multiple sources handled by the same compactor, with slower network and CPU you might not compact/downsample quick enough to cope with incoming blocks. 32 * This happens a lot if no compactor is deployed for longer periods and thus has to quickly catch up with large number of blocks (e.g couple of months). 33 * Store Gateway: Queries against store gateway which are touching large number of Sources might be expensive, so it has to scale up with number of Sources if we assume those queries. 34 * Orthogonally we did not advertise any labels on Store Gateway's Info. This means that querier was not able to do any pre-filtering, so all store gateways in system are always touched for each query. 35 36 ### Reminder: What is a Source 37 38 `Source` is a any component that creates new metrics in a form of Thanos TSDB blocks uploaded to the object storage. We differentiate Sources by `external labels`. Having unique sources has several benefits: 39 40 * Sources does not need to inject "global" source labels to all metrics (like `cluster, env, replica`). Those all the same for all metrics produced by source, we can assume that whole block has those. 41 * We can track what blocks are "duplicates": e.g in HA groups, where 2 replicas of Prometheus-es are scraping the same targets. 42 * We can track what source produced metrics in case of problems if any. 43 44 Example Sources are: Sidecar, Rule, Thanos Receive. 45 46 ## Use Cases 47 48 We can then define couple of use cases (some of them where already reported by users): 49 50 * Scaling out / Sharding Compactor. 51 * Scaling out / Sharding Store Gateways. 52 * Allowing pre-filtering of queries inside the Querier - thanks to labels advertised in Info call for all StoreAPIs (!). 53 * Filtering out portion of data: This is useful if you want to ignore suddenly some blocks in case of error/investigation/security. 54 * Different priority for different Sources. 55 * Some Sources might be more important then others. This might mean different availability and performance SLOs. Being able to split object storage operations across different components helps with that. NOTE: We mean here a per process priority (e.g one Store Gateway being more important then other). 56 57 ## Goals of this design 58 59 Our goals for this design it to find and implement solution for: 60 61 * Sharding browsing metrics from the object storage: 62 * e.g Selecting what blocks Store Gateway should expose. 63 * Minimal pre-filtering which shards Querier should touch during query. 64 * Sharding compaction/downsampling of metrics in the object storage. 65 * NOTE: We need to be really careful to not have 2 compactors working on the same Source. This means careful upgrades/configuration changes. There must be documentation for that at least. 66 67 ## No Goals 68 69 * Time partitioning: 70 * Store GW allows that already. 71 * "Merging" sources together virtually for downsampling/compactions across single Source that just changes external labels. 72 * [Bloom filters](https://github.com/thanos-io/thanos/issues/1611) (or custom advertised labels) for application metrics within blocks; advertise labels manipulation. 73 * Allow all permutations of object storage setups: 74 * User uses multiple object storages for all sources? 75 * User uses single object storage for all sources? 76 * User uses any mix of object storages for sources. They put all in different subdirs/subpaths. 77 * Add coordination or reconciliation in case of multi Compactor run on the same "sources" or any form of Compactor HA (e.g active passive) 78 * Requires separate design. ** multi-buckets/multi-prefix support. This is orthogonal. 79 80 ## Proposal 81 82 On each component that works on the object storage (e.g Store GW and Compactor), add `--selector.relabel-config` (and corresponding `--selector.relabel-config-file`) that will be used to filter out what blocks should be selected for operations. Examples: 83 84 * We want to run Compactor only for blocks with `external_labels` being `cluster=A`. We will run second Compactor for blocks with `cluster=B` external labels. 85 * We want to browse only blocks with `external_labels` being `cluster=A` from object storage. We will run StoreGateway with selector of `cluster=A` from external labels of blocks. 86 87 ### Relabelling 88 89 Similar to [promtail](https://grafana.com/docs/loki/latest/clients/promtail/scraping/#relabeling) this config will follow native [Prometheus relabel-config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config) syntax. 90 91 The relabel config will define filtering process done on **every** synchronization with object storage. 92 93 We will allow potentially manipulating with several of inputs: 94 95 * `__block_id` 96 * External labels: 97 * `<name>` 98 * `__block_objstore_bucket_endpoint` 99 * `__block_objstore_bucket_name` 100 * `__block_objstore_bucket_path` 101 102 Output: 103 104 * If output is empty, drop block. 105 106 By default, on empty relabel-config, all external labels are assumed. Intuitively blocks without any external labels will be ignored. 107 108 All blocks should compose as set of labels to advertise. The input should be based from original meta files. NOT from relabelling. The reasoning is covered in [`Next Steps`](#future-work) section 109 110 Example usages would be: 111 112 * Drop blocks which contains external labels cluster=A 113 114 ```yaml 115 - action: drop 116 regex: "A" 117 source_labels: 118 - cluster 119 ``` 120 121 * Keep only blocks which contains external labels cluster=A 122 123 ```yaml 124 - action: keep 125 regex: "A" 126 source_labels: 127 - cluster 128 ``` 129 130 ## Work Plan 131 132 * Add/import relabel config into Thanos, add relevant logic. 133 * Hook it for selecting blocks on Store Gateway 134 * Advertise original labels of "approved" blocs on resulted external labels. 135 * Hook it for selecting blocks on Compactor. 136 * Add documentation about following concern: Care must be taken with changing selection for compactor to unsure only single compactor ever running over each Source's blocks. 137 138 ## Future Work 139 140 * Add coordination or reconciliation in case of multi Compactor run on the same "sources" or any form of Compactor HA (e.g active passive) 141 * Requires separate design. 142 * Allow bloom-like filters: https://github.com/thanos-io/thanos/issues/1611 143 * Extend relabelling to allow adjusting advertised labels (e.g for Store Gateway). For example changing external labels as they might change over time for the same source that would allow to hide that change for user querying the data. 144 145 For example: 146 147 * Drop cluster label from external labels for each blocks (if present). 148 149 ```yaml 150 - action: labeldrop 151 regex: "^cluster$" 152 153 ``` 154 155 * Add `datacenter=ABC` external label to the result. 156 157 ```yaml 158 - target_label: custom_label 159 replacement: ABC 160 ``` 161 162 Note that current relabel implementation assumes there is always single value for each label. This would need to be adjusted for.