github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/20221129-avoid-global-sort.md (about) 1 --- 2 type: proposal 3 title: Avoid Global Sort on Querier Select 4 status: approved 5 owner: bwplotka,fpetkovski 6 menu: proposals-accepted 7 --- 8 9 ## Avoid Global Sort on Querier Select 10 11 * **Related Tickets:** 12 * https://github.com/thanos-io/thanos/issues/5719 13 * https://github.com/thanos-io/thanos/commit/043c5bfcc2464d3ae7af82a1428f6e0d6510f020 14 * https://github.com/thanos-io/thanos/pull/5796 also alternatives (https://github.com/thanos-io/thanos/pull/5692) 15 16 > TL;DR: We propose solution that allows saving query and query_range latency on common setups when deduplication on and data replication. Initial benchmarks indicate ~20% latency improvement for data replicated 2 times. 17 > 18 > To make it work we propose adding field to Store API Series call "WithoutReplicaLabels []string", guarded by "SupportsWithoutReplicaLabels" field propagated via Info API. It allows telling store implementations to remove given labels (if they are replica labels) from result, preserving sorting by labels after the removal. 19 > 20 > NOTE: This change will break unlikely setups that deduplicate on non-replica label (misconfiguration or wrong setup). 21 22 ## Glossary 23 24 **replica**: We use term "replica labels" as a subset of (or equal to) "external labels": Labels that indicate unique replication group for our data, usually taken from the metadata about origin/source. 25 26 ## Why 27 28 Currently, we spent a lof of storage selection CPU time on resorting resulting time series needed for deduplication (exactly in [`sortDedupLabels`](https://github.com/thanos-io/thanos/blob/main/pkg/query/querier.go#L400)). However, given distributed effort and current sorting guarantees of StoreAPI there is potential to reduce sorting effort or/and distribute it to leafs or multiple threads. 29 30 ### Pitfalls of the current solution 31 32 Current flow can be represented as follows: 33 34 ![img.png](../img/bottleneck-globalsort.png) 35 36 1. Querier PromQL Engine selects data. At this point we know if users asked for deduplicated data or not and [what replica labels to use](https://thanos.io/tip/components/query.md/#deduplication-replica-labels). 37 2. Querier selection asks internal, in-process Store API which is represented by Proxy code component. It asks relevant store API for data, using StoreAPI.Series. 38 3. Responses are pulled and k-way merged by the time series. StoreAPI guarantees the responses are sorted by series and the external labels (including replica) are included in the time series. 39 * There was a [bug in receiver](https://github.com/thanos-io/thanos/commit/043c5bfcc2464d3ae7af82a1428f6e0d6510f020#diff-b3f73a54121d88de203946e84955da7027e3cfce7f0cd82580bf215ac57c02f4) that caused series to be not sorted when returned. Fixed in v0.29.0. 40 4. Querier selection waits until all responses are buffered and then it deduplicates the data, given the requested replica labels. Before it's done it globally sort data with moving replica label at the end of the time series in `sortDedupLabels`. 41 5. Data is deduplicated using `dedup` package. 42 43 The pittfall is in the fact that global sort can be in many cases completely avoided, even when deduplication is enabled. Many storeAPIs can drop certain replica labels without need to resort and others can k-way merge different data sets without certain replica labels without extra effort. 44 45 ## Goals 46 47 Goals and use cases for the solution as proposed in [How](#how): 48 49 * Avoid expensive global sort of all series before passing them to PromQL engine in Querier. 50 * Allow StoreAPI implementation to announce if it supports sorting feature or not. The rationale is that we want to make it possible to create simpler StoreAPI servers, if operator wants to trade-off it with latency. 51 * Clear the behaviour in tricky cases when there is an overlap of replica labels between what's in TSDB vs what's attached as external labels. 52 * Ensure this change can be rolled out in compatible way. 53 54 ## Non-Goals 55 56 * Allow consuming series in streamed way in PromQL engine. 57 * While this pitfall (global sort) blocks the above idea, it's currently still more beneficial to pull all series upfront (eager approach) as soon as possible. This is due to current PromQL architecture which requires info upfront for query planners and execution. We don't plan to change it yet, thus no need to push explicitly for that. 58 59 ## How 60 61 ### Invariants 62 63 To understand proposal, let's go through important, yet perhaps not trivial, facts: 64 65 * For StoreAPI or generally data that belongs to one replica, if you exclude certain replica label during sort, it does not impact sorting order for returned series. This means, any feature that desired different sort for replicated series is generally noop for sidecars, rules, single tenant receiver or within single block (or one stream of blocks). 66 * You can't stream sorting of unsorted data. Furthermore, it's not possible to detect that data is unsorted, unless we fetch and buffer all series. 67 * In v0.29 and below, you can deduplicate on any labels, including non replicas. This is assumed semantically wrong, yet someone might depend on it. 68 * Thanos never handled overlap of chunks within one set of store API response. 69 70 ### Solution 71 72 To avoid global sort, we propose removing required replica labels and sort on store API level. 73 74 For the first step (which is required for compatibility purposes anyway), we propose a logic in proxy Store API implementation that when deduplication is requested with given replica labels will: 75 76 * Fallback to eager retrieval. 77 * Remove given labels from series (this is can remove non-replica labels too, same as it is possible now). 78 * Resort all series (just on local level). 79 80 Thanks of that the k-way merge will sort based on series without replica labels that will allow querier dedup to be done in streaming way without global sort and replica label removal. 81 82 As the second step we propose adding `without_replica_labels` field to `SeriesResponse` proto message of Store API: 83 84 ```protobuf 85 message SeriesRequest { 86 // ... 87 88 // without_replica_labels are replica labels which have to be excluded from series set results. 89 // The sorting requirement has to be preserved, so series should be sorted without those labels. 90 // If the requested label is NOT a replica label (labels that identify replication group) it should be not affected by 91 // this setting (label should be included in sorting and response). 92 // It is the server responsibility to detect and track what is replica label and what is not. 93 // This allows faster deduplication by clients. 94 // NOTE(bwplotka): thanos.info.store.supports_without_replica_labels field has to return true to let client knows 95 // server supports it. 96 repeated string without_replica_labels = 14; 97 ``` 98 99 Since it's a new field, for compatibility we also propose adding `supports_without_replica_labels` in InfoAPI to indicate a server supports it explicitly. 100 101 ```protobuf 102 // StoreInfo holds the metadata related to Store API exposed by the component. 103 message StoreInfo { 104 reserved 4; // Deprecated send_sorted, replaced by supports_without_replica_labels now. 105 106 int64 min_time = 1; 107 int64 max_time = 2; 108 bool supports_sharding = 3; 109 110 // supports_without_replica_labels means this store supports without_replica_labels of StoreAPI.Series. 111 bool supports_without_replica_labels = 5; 112 } 113 ``` 114 115 Thanks of that implementations can optionally support this feature. We can make all Thanos StoreAPI support it, which will allow faster deduplication queries on all types of setups. 116 117 In the initial tests we see 60% improvements on my test data (8M series block, requests for ~200k series) with querier and store gateway. 118 119 Without this change: 120 121 ![1](../img/globalsort-nonoptimized.png) 122 123 After implementing this proposal: 124 125 ![2](../img/globalsort-optimized.png) 126 127 ## Alternatives 128 129 1. Version StoreAPI. 130 131 As a best practice gRPC services should be versioned. This should allow easier iterations for everybody implementing or using it. However, having multiple versions (vs extra feature enablement field) might make client side more complex, so we propose to postpone it. 132 133 2. Optimization: Add "replica group" as another message in `SeriesResponse` 134 135 Extra slice in all Series might feel redundant, given all series are always grouped within the same replica. Let's do this once we see it being a bottleneck (will require change in StoreAPI version). 136 137 3. Instead of removing some replica labels, just sort without them and leave at the end. 138 139 For debugging purposes we could keep the replica labels we want to dedup on at the end of label set. 140 141 This might however be less clean way of providing better debuggability, which is not yet required. 142 143 Cons: 144 * Feels hacky. Proper way for preserving this information would be like alternative 4. 145 * Debuggability might be not needed here - YAGNI 146 147 4. Replica label struct 148 149 We could make Store API response fully replica aware. This means that series response will now include an extra slice of replica labels that this series belongs to: 150 151 ```protobuf 152 message Series { 153 repeated Label labels = 1 [(gogoproto.nullable) = false, (gogoproto.customtype) = "github.com/thanos-io/thanos/pkg/store/labelpb.ZLabel"]; 154 repeated Label replica_labels = 3 [(gogoproto.nullable) = false, (gogoproto.customtype) = "github.com/thanos-io/thanos/pkg/store/labelpb.ZLabel"]; // Added. 155 156 repeated AggrChunk chunks = 2 [(gogoproto.nullable) = false]; 157 } 158 ``` 159 160 Pros: 161 * Easy to tell what is replica what's not on client of Store API level 162 163 Cons: 164 * Extra code and protobuf complexity 165 * Semantics of replica labels are hard to maintain when partial deduplication is configured (we only dedup by part of replica labels, not by all of them). This dynamic policy makes it hard to have clean response with separation of replica labels (i.e. should included replica labels be in "labels" or "replica labels")? 166 167 This might be not needed for now. We can add more awareness of replication later on. 168 169 ## Action Plan 170 171 The tasks to do in order to migrate to the new idea. 172 173 * [X] Merging the PR with the proposal (also includes implementation) 174 * [X] Add support for `without_replica_label` to other store API servers. 175 * [ ] Move to deduplicate over chunks from series See [TODO in querier.go:405](../../pkg/query/querier.go) 176 177 ```go 178 // TODO(bwplotka): Move to deduplication on chunk level inside promSeriesSet, similar to what we have in dedup.NewDedupChunkMerger(). 179 // This however require big refactor, caring about correct AggrChunk to iterator conversion, pushdown logic and counter reset apply. 180 // For now we apply simple logic that splits potential overlapping chunks into separate replica series, so we can split the work. 181 set := &promSeriesSet{ 182 mint: q.mint, 183 maxt: q.maxt, 184 set: dedup.NewOverlapSplit(newStoreSeriesSet(resp.seriesSet)), 185 aggrs: aggrs, 186 warns: warns, 187 } 188 ```