github.com/thanos-io/thanos@v0.32.5/docs/proposals-done/201909-thanos-sharding.md

github.com/thanos-io/thanos@v0.32.5/docs/proposals-done/201909-thanos-sharding.md (about)

1 ---
2 type: proposal
3 title: Thanos Sharding for Long Term Retention Storage
4 status: complete
5 owner: bwplotka
6 menu: proposals-done
7 ---
8
9 ### Related Tickets
10
11 * https://github.com/thanos-io/thanos/issues/1034 (main)
12 * https://github.com/thanos-io/thanos/pull/1245 (attempt on compactor)
13 * https://github.com/thanos-io/thanos/pull/1059 (attempt on store)
14 * https://github.com/thanos-io/thanos/issues/1318 (subpaths, multitenancy)
15 * Store issues because of large bucket:
16 * https://github.com/thanos-io/thanos/issues/814
17 * https://github.com/thanos-io/thanos/issues/1455
18
19 ## Summary
20
21 This document describes the motivation and design of the sharding of Thanos components in terms of operations against object storage. Additionally we touch possibility for smarter pre-filtering of shards on the Querier.
22
23 ## Motivation
24
25 Currently all components that read from object store assume that all the operations and functionality should be done based on **all** the available blocks that are present in the certain bucket's root directory.
26
27 This is in most cases totally fine, however with time and allowance of storing blocks from multiple `Sources` into the same bucket, the number of objects in a bucket can grow drastically.
28
29 This means that with time you might want to scale out certain components e.g:
30
31 * Compactor: Larger number of objects does not matter much, however compactor has to scale (CPU, network) with number of Sources pushing blocks to the object storage. If you have multiple sources handled by the same compactor, with slower network and CPU you might not compact/downsample quick enough to cope with incoming blocks.
32 * This happens a lot if no compactor is deployed for longer periods and thus has to quickly catch up with large number of blocks (e.g couple of months).
33 * Store Gateway: Queries against store gateway which are touching large number of Sources might be expensive, so it has to scale up with number of Sources if we assume those queries.
34 * Orthogonally we did not advertise any labels on Store Gateway's Info. This means that querier was not able to do any pre-filtering, so all store gateways in system are always touched for each query.
35
36 ### Reminder: What is a Source
37
38 `Source` is a any component that creates new metrics in a form of Thanos TSDB blocks uploaded to the object storage. We differentiate Sources by `external labels`. Having unique sources has several benefits:
39
40 * Sources does not need to inject "global" source labels to all metrics (like `cluster, env, replica`). Those all the same for all metrics produced by source, we can assume that whole block has those.
41 * We can track what blocks are "duplicates": e.g in HA groups, where 2 replicas of Prometheus-es are scraping the same targets.
42 * We can track what source produced metrics in case of problems if any.
43
44 Example Sources are: Sidecar, Rule, Thanos Receive.
45
46 ## Use Cases
47
48 We can then define couple of use cases (some of them where already reported by users):
49
50 * Scaling out / Sharding Compactor.
51 * Scaling out / Sharding Store Gateways.
52 * Allowing pre-filtering of queries inside the Querier - thanks to labels advertised in Info call for all StoreAPIs (!).
53 * Filtering out portion of data: This is useful if you want to ignore suddenly some blocks in case of error/investigation/security.
54 * Different priority for different Sources.
55 * Some Sources might be more important then others. This might mean different availability and performance SLOs. Being able to split object storage operations across different components helps with that. NOTE: We mean here a per process priority (e.g one Store Gateway being more important then other).
56
57 ## Goals of this design
58
59 Our goals for this design it to find and implement solution for:
60
61 * Sharding browsing metrics from the object storage:
62 * e.g Selecting what blocks Store Gateway should expose.
63 * Minimal pre-filtering which shards Querier should touch during query.
64 * Sharding compaction/downsampling of metrics in the object storage.
65 * NOTE: We need to be really careful to not have 2 compactors working on the same Source. This means careful upgrades/configuration changes. There must be documentation for that at least.
66
67 ## No Goals
68
69 * Time partitioning:
70 * Store GW allows that already.
71 * "Merging" sources together virtually for downsampling/compactions across single Source that just changes external labels.
72 * [Bloom filters](https://github.com/thanos-io/thanos/issues/1611) (or custom advertised labels) for application metrics within blocks; advertise labels manipulation.
73 * Allow all permutations of object storage setups:
74 * User uses multiple object storages for all sources?
75 * User uses single object storage for all sources?
76 * User uses any mix of object storages for sources. They put all in different subdirs/subpaths.
77 * Add coordination or reconciliation in case of multi Compactor run on the same "sources" or any form of Compactor HA (e.g active passive)
78 * Requires separate design. ** multi-buckets/multi-prefix support. This is orthogonal.
79
80 ## Proposal
81
82 On each component that works on the object storage (e.g Store GW and Compactor), add `--selector.relabel-config` (and corresponding `--selector.relabel-config-file`) that will be used to filter out what blocks should be selected for operations. Examples:
83
84 * We want to run Compactor only for blocks with `external_labels` being `cluster=A`. We will run second Compactor for blocks with `cluster=B` external labels.
85 * We want to browse only blocks with `external_labels` being `cluster=A` from object storage. We will run StoreGateway with selector of `cluster=A` from external labels of blocks.
86
87 ### Relabelling
88
89 Similar to [promtail](https://grafana.com/docs/loki/latest/clients/promtail/scraping/#relabeling) this config will follow native [Prometheus relabel-config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config) syntax.
90
91 The relabel config will define filtering process done on **every** synchronization with object storage.
92
93 We will allow potentially manipulating with several of inputs:
94
95 * `__block_id`
96 * External labels:
97 * `<name>`
98 * `__block_objstore_bucket_endpoint`
99 * `__block_objstore_bucket_name`
100 * `__block_objstore_bucket_path`
101
102 Output:
103
104 * If output is empty, drop block.
105
106 By default, on empty relabel-config, all external labels are assumed. Intuitively blocks without any external labels will be ignored.
107
108 All blocks should compose as set of labels to advertise. The input should be based from original meta files. NOT from relabelling. The reasoning is covered in [`Next Steps`](#future-work) section
109
110 Example usages would be:
111
112 * Drop blocks which contains external labels cluster=A
113
114 ```yaml
115 - action: drop
116 regex: "A"
117 source_labels:
118 - cluster
119 ```
120
121 * Keep only blocks which contains external labels cluster=A
122
123 ```yaml
124 - action: keep
125 regex: "A"
126 source_labels:
127 - cluster
128 ```
129
130 ## Work Plan
131
132 * Add/import relabel config into Thanos, add relevant logic.
133 * Hook it for selecting blocks on Store Gateway
134 * Advertise original labels of "approved" blocs on resulted external labels.
135 * Hook it for selecting blocks on Compactor.
136 * Add documentation about following concern: Care must be taken with changing selection for compactor to unsure only single compactor ever running over each Source's blocks.
137
138 ## Future Work
139
140 * Add coordination or reconciliation in case of multi Compactor run on the same "sources" or any form of Compactor HA (e.g active passive)
141 * Requires separate design.
142 * Allow bloom-like filters: https://github.com/thanos-io/thanos/issues/1611
143 * Extend relabelling to allow adjusting advertised labels (e.g for Store Gateway). For example changing external labels as they might change over time for the same source that would allow to hide that change for user querying the data.
144
145 For example:
146
147 * Drop cluster label from external labels for each blocks (if present).
148
149 ```yaml
150 - action: labeldrop
151 regex: "^cluster$"
152
153 ```
154
155 * Add `datacenter=ABC` external label to the result.
156
157 ```yaml
158 - target_label: custom_label
159 replacement: ABC
160 ```
161
162 Note that current relabel implementation assumes there is always single value for each label. This would need to be adjusted for.