github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/202205-vertical-query-sharding.md (about) 1 --- 2 type: proposal 3 title: Vertical query sharding 4 status: accepted 5 owner: fpetkovski, moadz 6 menu: proposals-accepted 7 --- 8 9 * **Owners:** 10 * @fpetkovski 11 * @moadz 12 13 ## Why 14 15 The current query execution model in Thanos does not scale well with the size of query. We therefore propose a query sharding algorithm which distributes the query execution across multiple Thanos Queriers. The proposed algorithm shards queries by series (vertically) and is complementary to the (horizontal) range splitting operation implemented in Query Frontend. Even though horizontal sharding can be useful for distributing long-ranged queries (12h or longer), high cardinality metrics can still cause performance issues even for short ranged queries. Vertical query sharding breaks down large queries into disjoint datasets that can be retrieved and processed in parallel, minimising the need for large single-node Queriers and allowing us to pursue more effective scheduling of work on the read path. 16 17 ### Pitfalls of the current solution 18 19 When executing a PromQL query, a Querier will pull series from all of its downstream Stores in memory and feed them to the Prometheus query engine. In cases where a query causes a large number of series to be fetched, executing the query can lead to high memory usage in Queriers. As a result, even when more than one Querier is available, work is concentrated in a single Querier and cannot be easily distributed. 20 21 ## Goals 22 * **Must** allow the sharding of aggregation expressions across N queries, where N denotes a user-provided sharding factor 23 * **Must** fall back to single query when a non-shardable query or querier is encountered in the query path 24 * **Could** serve as a motivation for a general query-sharding framework that can be applied to any query 25 * **Should not** make any changes to the current query request path (i.e. how queries are scheduled in Queriers) 26 27 ### Audience 28 29 * Users who run Thanos at a large scale and would like to benefit from improved stability of the Thanos read path. 30 31 ## How 32 33 ### The query sharding algorithm 34 35 The query sharding algorithm takes advantage of the grouping labels provided in PromQL expressions and can be applied to the large majority of PromQL queries which aggregate timeseries *by* or *without* one more grouping labels. 36 37 In order to illustrate how it works, we can take the example of executing the following PromQL query 38 39 ``` 40 sum by (pod) (memory_usage_bytes) 41 ``` 42 43 on the series set: 44 45 ``` 46 memory_usage_bytes{pod="prometheus-1", region="eu-1", role="apps"} 47 memory_usage_bytes{pod="prometheus-1", region="us-1", role="infra"} 48 memory_usage_bytes{pod="prometheus-2", region="eu-1", role="apps"} 49 memory_usage_bytes{pod="prometheus-2", region="us-1", role="infra"} 50 ``` 51 52 This query performs a sum over series for each individual pod, and is equivalent to the union of following two queries: 53 54 ``` 55 sum by (pod) (memory_usage_bytes{pod="prometheus-1"}) 56 ``` 57 58 and 59 60 ``` 61 sum by (pod) (memory_usage_bytes{pod="prometheus-2"}) 62 ``` 63 64 We can therefore execute each one of the two queries in a separate query shard and concatenate the result before returning it to the user. 65 66 ### Dynamic series partitioning 67 68 Since Queriers have no information about which timeseries are available in Stores, they cannot easily rewrite a single aggregation into two disjoint ones. They can, however, propagate information to Stores that instruct them to only return one disjoint shard of the series that match particular selectors. To achieve this, each query shard would propagate the total number of shards, its own shard index, and the grouping labels discovered in the PromQL expression. Stores would then perform a `hashmod` on the grouping labels against each series and return only those series whose `hashmod` equals the requested shard index. 69 70 In our example, the total number of shards would be 2 and the only grouping label is `pod`. The `hashmod` on the `pod` label for each series would be as follows 71 72 ``` 73 # hash(pod=prometheus-1) mod 2 = 8848352764449055670 mod 2 = 0 74 memory_usage_bytes{pod="prometheus-1", region="eu-1", role="apps"} 75 76 # hash(pod=prometheus-1) mod 2 = 8848352764449055670 mod 2 = 0 77 memory_usage_bytes{pod="prometheus-1", region="us-1", role="apps"} 78 79 # hash(pod=prometheus-2) mod 2 = 14949237384223363101 mod 2 = 1 80 memory_usage_bytes{pod="prometheus-2", region="eu-1", role="apps"} 81 82 # hash(pod=prometheus-2) mod 2 = 14949237384223363101 mod 2 = 1 83 memory_usage_bytes{pod="prometheus-2", region="us-1", role="apps"} 84 ``` 85 86 The first two series would therefore end up in the first query shard, and the second two series on the second query shard. The Queriers will execute the query for their own subset only, and the sharding component will concatenate the results before returning them to the users. 87 88 The reason why partitioning the series set on the grouping labels works is because a grouping aggregation can be executed independently on each combination of values for the labels found in the grouping clause. In our example, as long as series with the same `pod` label end up in the same query shard, we can safely perform the shard and merge strategy. 89 90 ### Initiating sharded queries 91 92 The Query Frontend component already has useful splitting and merging capabilities which are currently used for sharding queries horizontally. In order to provide the best possible user experience and not burden users with running additional component(s), we propose adding a new middleware to Query Frontend which will implement the vertical sharding algorithm proposed here after the step-alignment and horizontal time-slicing steps. This would allow users to restrict the maximum complexity of a given query based on vertical sharding of already time-sliced queries. The only new user-specified parameter would be the number of shards in which to split PromQL aggregations. 93 94 Integrating vertical-query sharding in QFE also has the added benefits of: 95 * Using sharding for instant queries, and by extension, for alerting and recording rules 96 * Utilizing the existing caching implementation 97 * Sharding a query vertically, after it has already been time-sliced, reduces the cardinality of each respective query shard 98 99 The following diagram illustrates how the sharding algorithm would work end to end: 100 101 ![Vertical sharding](../img/vertical-sharding.png "Vertical query sharding") 102 103 ### Drawbacks & Exceptions 104 105 *Not all queries are easily shardeable* There are certain aggregations for which sharding cannot be performed safely. For these cases the sharder can simply fall back to completely executing the expression as a single query. These cases include the use of functions such as `label_replace` and `label_join` which create new labels inside PromQL, when the query is being executed. Since such labels can be arbitrarily created while executing queries, Stores will be unaware of them and cannot take them into account when sharding the matching series set. 106 107 #TODO Mention promql parsing/transformation to increase queries that can be sharded 108 109 *Impact of sharded queries on block retrieval* Given that queries are sharded based on series, and not time, for a given leaf most metrics for each independent shard will likely reside in the same block(s). This can multiply the cost of retrieving series from a given block since the block has to be traversed N number of times, where N is the number of distinct shards. 110 111 In our sharded query benchmarks, we found that the increased volume of Series calls to Store Gateways and Receivers did not lead to a significant latency increase. However, we did notice a higher remote-read latency in Prometheus caused by multiple remote-read calls for the same Series call. This is due to the fact that the sharding has to be done in the sidecar, after retrieving all matching series for a query. A similar concern was raised in Prometheus already, with a proposal to support sharding series natively in the Prometheus TSDB (https://github.com/prometheus/prometheus/issues/10420). 112 113 Another concern we identified with Store GW is the impact of sharded queries on postings and chunk lookups. Whenever Store Gateway receives a Series call, it will first retrieve postings for the given label matchers (from cache), merge those postings and determine what blocks/chunks it needs to stream to facilitate this query. The potential issues sharding could introduce is an excess number of index lookups and subsequent downloading of the same chunk multiple times. This has not been measured for the spike, so we are unsure what the impact will be. 114 115 ### Further improvements 116 117 In the future, additional improvements can be made to Thanos so that TSDB blocks are sharded in the background during the compaction process. Each block would could be sharded in several smaller blocks and get its own `shard_id` label attached to it. Store Gateways would then use this label to avoid doing `hashmod` over each series at query time. 118 119 # Alternatives 120 121 *Query Pushdown* Query pushdown involves pushing down the entire query to leaves for evaluation. This avoids the primary contributor to query latency (Store API Select over the network) but can only work on a limited set of queries, as there are no guarantees that duplicate series will not be double counted in disparate leaves. Vertical query sharding is a natural evolution of this idea that handles deduplication by guaranteeing that each unique series in a query will always end up on the same Querier. A version of this has already been implemented (https://github.com/thanos-io/thanos/pull/4917) 122 123 *Extended horizontal sharding* Thanos Query Frontend already does horizontal (based on time-range) sharding by splitting up queries into smaller time-ranges. As an alternative to vertical sharding, more granular horizontal sharding could be implemented to split a query up between Queriers at smaller increments. This does not have the same deduplication correctness problem as query pushdown, as overlapping time-ranges can be effectively deduplicated. This implementation, however, presents other challenges around aligning queries in such a way that guarantees a sample on the threshold is not double counted in distinct shards. Generally, time-ranges are also more complex to handle compared to a simple hashmod of label value pairs as the scrape interval can be different for each metric. Horizontal sharding also does not address cardinality, instead, sharding is based on number of samples over a time range, which are already highly compressible. 124 125 *Implement a streaming PromQL engine* Vertical query sharding allows to breakdown large queries into disjoint datasets that can be queried and processed in parallel, minimising the need for large single-node Queriers and allowing us to pursue more effective scheduling of work on the read path. This is needed as PromQL is mostly single-threaded and requires the entire set of expanded series pre-evaluation. A similar effect to vertical sharding can be achieved if PromQL itself supported streaming query evaluation, allowing us to limit and parallelise retrieval and execution in a single-node. This was discussed in upstream Prometheus (https://github.com/prometheus/prometheus/issues/7326). 126 127 ### Reference implementation and benchmarks 128 129 A reference implementation is available as a draft PR: https://github.com/thanos-io/thanos/pull/5342. 130 131 The key component is the `QueryAnalyzer` which traverses the PromQL AST and extracts labels which the dataset can be sharded on. A good way to understand how it works is to look at the test cases for it: https://github.com/thanos-io/thanos/pull/5342/files#diff-025e491f39aac710d300ae708cfaa09d6bf5929ea4b4ce60f4b9e0f0a179e67fR10. Individual stores then use the labels to shard the series and return only one part of the resulting dataset: https://github.com/thanos-io/thanos/pull/5342/files#diff-3e2896fafa6ff73509c77df2c4389b68828e02575bb4fb78b6c34bcfb922a7ceR828-R835 132 133 Using the reference implementation, we benchmarked query execution and memory usage of Queriers. We synthesized a dataset of 100.000 series with two labels, a `cluster` label with 100 values and a pod label with 1000 values. The program used to generate the dataset, as well as the manifests to run deploy the reference implementation is available in a Github repository: https://github.com/fpetkovski/thanos-sharding-bench 134 135 We then ran the following query on the reference dataset for 10-15 minutes: `sum by (pod) (http_requests_total)` 136 137 The memory usage of Queriers with and without sharding was ~650MB and ~1.5GB respectively, as shown n the screenshots bellow. 138 139 Memory usage with sharding: 140 141 <img src="../img/memory-with-sharding.png" alt="Memory usage with sharding" width="600"/> 142 143 Memory usage without sharding: 144 145 <img src="../img/memory-without-sharding.png" alt="Memory usage without sharding" width="600"/> 146 147 We also found that the sharded implementation reduced query execution time from ~10s to ~5s. 148 149 Latency with sharding: 150 151 <img src="../img/latency-with-sharding.png" alt="Latency with sharding" width="600"/> 152 153 Latency without sharding: 154 155 <img src="../img/latency-without-sharding.png" alt="Latency without sharding" width="600"/>