github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/202205-vertical-query-sharding.md (about)

     1  ---
     2  type: proposal
     3  title: Vertical query sharding
     4  status: accepted
     5  owner: fpetkovski, moadz
     6  menu: proposals-accepted
     7  ---
     8  
     9  * **Owners:**
    10    * @fpetkovski
    11    * @moadz
    12  
    13  ## Why
    14  
    15  The current query execution model in Thanos does not scale well with the size of query. We therefore propose a query sharding algorithm which distributes the query execution across multiple Thanos Queriers. The proposed algorithm shards queries by series (vertically) and is complementary to the (horizontal) range splitting operation implemented in Query Frontend. Even though horizontal sharding can be useful for distributing long-ranged queries (12h or longer), high cardinality metrics can still cause performance issues even for short ranged queries. Vertical query sharding breaks down large queries into disjoint datasets that can be retrieved and processed in parallel, minimising the need for large single-node Queriers and allowing us to pursue more effective scheduling of work on the read path.
    16  
    17  ### Pitfalls of the current solution
    18  
    19  When executing a PromQL query, a Querier will pull series from all of its downstream Stores in memory and feed them to the Prometheus query engine. In cases where a query causes a large number of series to be fetched, executing the query can lead to high memory usage in Queriers. As a result, even when more than one Querier is available, work is concentrated in a single Querier and cannot be easily distributed.
    20  
    21  ## Goals
    22  * **Must** allow the sharding of aggregation expressions across N queries, where N denotes a user-provided sharding factor
    23  * **Must** fall back to single query when a non-shardable query or querier is encountered in the query path
    24  * **Could** serve as a motivation for a general query-sharding framework that can be applied to any query
    25  * **Should not** make any changes to the current query request path (i.e. how queries are scheduled in Queriers)
    26  
    27  ### Audience
    28  
    29  * Users who run Thanos at a large scale and would like to benefit from improved stability of the Thanos read path.
    30  
    31  ## How
    32  
    33  ### The query sharding algorithm
    34  
    35  The query sharding algorithm takes advantage of the grouping labels provided in PromQL expressions and can be applied to the large majority of PromQL queries which aggregate timeseries *by* or *without* one more grouping labels.
    36  
    37  In order to illustrate how it works, we can take the example of executing the following PromQL query
    38  
    39  ```
    40  sum by (pod) (memory_usage_bytes)
    41  ```
    42  
    43  on the series set:
    44  
    45  ```
    46  memory_usage_bytes{pod="prometheus-1", region="eu-1", role="apps"}
    47  memory_usage_bytes{pod="prometheus-1", region="us-1", role="infra"}
    48  memory_usage_bytes{pod="prometheus-2", region="eu-1", role="apps"}
    49  memory_usage_bytes{pod="prometheus-2", region="us-1", role="infra"}
    50  ```
    51  
    52  This query performs a sum over series for each individual pod, and is equivalent to the union of following two queries:
    53  
    54  ```
    55  sum by (pod) (memory_usage_bytes{pod="prometheus-1"})
    56  ```
    57  
    58  and
    59  
    60  ```
    61  sum by (pod) (memory_usage_bytes{pod="prometheus-2"})
    62  ```
    63  
    64  We can therefore execute each one of the two queries in a separate query shard and concatenate the result before returning it to the user.
    65  
    66  ### Dynamic series partitioning
    67  
    68  Since Queriers have no information about which timeseries are available in Stores, they cannot easily rewrite a single aggregation into two disjoint ones. They can, however, propagate information to Stores that instruct them to only return one disjoint shard of the series that match particular selectors. To achieve this, each query shard would propagate the total number of shards, its own shard index, and the grouping labels discovered in the PromQL expression. Stores would then perform a `hashmod` on the grouping labels against each series and return only those series whose `hashmod` equals the requested shard index.
    69  
    70  In our example, the total number of shards would be 2 and the only grouping label is `pod`. The `hashmod` on the `pod` label for each series would be as follows
    71  
    72  ```
    73  # hash(pod=prometheus-1) mod 2 = 8848352764449055670 mod 2 = 0
    74  memory_usage_bytes{pod="prometheus-1", region="eu-1", role="apps"}
    75  
    76  # hash(pod=prometheus-1) mod 2 = 8848352764449055670 mod 2 = 0
    77  memory_usage_bytes{pod="prometheus-1", region="us-1", role="apps"}
    78  
    79  # hash(pod=prometheus-2) mod 2 = 14949237384223363101 mod 2 = 1
    80  memory_usage_bytes{pod="prometheus-2", region="eu-1", role="apps"}
    81  
    82  # hash(pod=prometheus-2) mod 2 = 14949237384223363101 mod 2 = 1
    83  memory_usage_bytes{pod="prometheus-2", region="us-1", role="apps"}
    84  ```
    85  
    86  The first two series would therefore end up in the first query shard, and the second two series on the second query shard. The Queriers will execute the query for their own subset only, and the sharding component will concatenate the results before returning them to the users.
    87  
    88  The reason why partitioning the series set on the grouping labels works is because a grouping aggregation can be executed independently on each combination of values for the labels found in the grouping clause. In our example, as long as series with the same `pod` label end up in the same query shard, we can safely perform the shard and merge strategy.
    89  
    90  ### Initiating sharded queries
    91  
    92  The Query Frontend component already has useful splitting and merging capabilities which are currently used for sharding queries horizontally. In order to provide the best possible user experience and not burden users with running additional component(s), we propose adding a new middleware to Query Frontend which will implement the vertical sharding algorithm proposed here after the step-alignment and horizontal time-slicing steps. This would allow users to restrict the maximum complexity of a given query based on vertical sharding of already time-sliced queries. The only new user-specified parameter would be the number of shards in which to split PromQL aggregations.
    93  
    94  Integrating vertical-query sharding in QFE also has the added benefits of:
    95  * Using sharding for instant queries, and by extension, for alerting and recording rules
    96  * Utilizing the existing caching implementation
    97  * Sharding a query vertically, after it has already been time-sliced, reduces the cardinality of each respective query shard
    98  
    99  The following diagram illustrates how the sharding algorithm would work end to end:
   100  
   101  ![Vertical sharding](../img/vertical-sharding.png "Vertical query sharding")
   102  
   103  ### Drawbacks & Exceptions
   104  
   105  *Not all queries are easily shardeable* There are certain aggregations for which sharding cannot be performed safely. For these cases the sharder can simply fall back to completely executing the expression as a single query. These cases include the use of functions such as `label_replace` and `label_join` which create new labels inside PromQL, when the query is being executed. Since such labels can be arbitrarily created while executing queries, Stores will be unaware of them and cannot take them into account when sharding the matching series set.
   106  
   107  #TODO Mention promql parsing/transformation to increase queries that can be sharded
   108  
   109  *Impact of sharded queries on block retrieval* Given that queries are sharded based on series, and not time, for a given leaf most metrics for each independent shard will likely reside in the same block(s). This can multiply the cost of retrieving series from a given block since the block has to be traversed N number of times, where N is the number of distinct shards.
   110  
   111  In our sharded query benchmarks, we found that the increased volume of Series calls to Store Gateways and Receivers did not lead to a significant latency increase. However, we did notice a higher remote-read latency in Prometheus caused by multiple remote-read calls for the same Series call. This is due to the fact that the sharding has to be done in the sidecar, after retrieving all matching series for a query. A similar concern was raised in Prometheus already, with a proposal to support sharding series natively in the Prometheus TSDB (https://github.com/prometheus/prometheus/issues/10420).
   112  
   113  Another concern we identified with Store GW is the impact of sharded queries on postings and chunk lookups. Whenever Store Gateway receives a Series call, it will first retrieve postings for the given label matchers (from cache), merge those postings and determine what blocks/chunks it needs to stream to facilitate this query. The potential issues sharding could introduce is an excess number of index lookups and subsequent downloading of the same chunk multiple times. This has not been measured for the spike, so we are unsure what the impact will be.
   114  
   115  ### Further improvements
   116  
   117  In the future, additional improvements can be made to Thanos so that TSDB blocks are sharded in the background during the compaction process. Each block would could be sharded in several smaller blocks and get its own `shard_id` label attached to it. Store Gateways would then use this label to avoid doing `hashmod` over each series at query time.
   118  
   119  # Alternatives
   120  
   121  *Query Pushdown* Query pushdown involves pushing down the entire query to leaves for evaluation. This avoids the primary contributor to query latency (Store API Select over the network) but can only work on a limited set of queries, as there are no guarantees that duplicate series will not be double counted in disparate leaves. Vertical query sharding is a natural evolution of this idea that handles deduplication by guaranteeing that each unique series in a query will always end up on the same Querier. A version of this has already been implemented (https://github.com/thanos-io/thanos/pull/4917)
   122  
   123  *Extended horizontal sharding* Thanos Query Frontend already does horizontal (based on time-range) sharding by splitting up queries into smaller time-ranges. As an alternative to vertical sharding, more granular horizontal sharding could be implemented to split a query up between Queriers at smaller increments. This does not have the same deduplication correctness problem as query pushdown, as overlapping time-ranges can be effectively deduplicated. This implementation, however, presents other challenges around aligning queries in such a way that guarantees a sample on the threshold is not double counted in distinct shards. Generally, time-ranges are also more complex to handle compared to a simple hashmod of label value pairs as the scrape interval can be different for each metric. Horizontal sharding also does not address cardinality, instead, sharding is based on number of samples over a time range, which are already highly compressible.
   124  
   125  *Implement a streaming PromQL engine* Vertical query sharding allows to breakdown large queries into disjoint datasets that can be queried and processed in parallel, minimising the need for large single-node Queriers and allowing us to pursue more effective scheduling of work on the read path. This is needed as PromQL is mostly single-threaded and requires the entire set of expanded series pre-evaluation. A similar effect to vertical sharding can be achieved if PromQL itself supported streaming query evaluation, allowing us to limit and parallelise retrieval and execution in a single-node. This was discussed in upstream Prometheus (https://github.com/prometheus/prometheus/issues/7326).
   126  
   127  ### Reference implementation and benchmarks
   128  
   129  A reference implementation is available as a draft PR: https://github.com/thanos-io/thanos/pull/5342.
   130  
   131  The key component is the `QueryAnalyzer` which traverses the PromQL AST and extracts labels which the dataset can be sharded on. A good way to understand how it works is to look at the test cases for it: https://github.com/thanos-io/thanos/pull/5342/files#diff-025e491f39aac710d300ae708cfaa09d6bf5929ea4b4ce60f4b9e0f0a179e67fR10. Individual stores then use the labels to shard the series and return only one part of the resulting dataset: https://github.com/thanos-io/thanos/pull/5342/files#diff-3e2896fafa6ff73509c77df2c4389b68828e02575bb4fb78b6c34bcfb922a7ceR828-R835
   132  
   133  Using the reference implementation, we benchmarked query execution and memory usage of Queriers. We synthesized a dataset of 100.000 series with two labels, a `cluster` label with 100 values and a pod label with 1000 values. The program used to generate the dataset, as well as the manifests to run deploy the reference implementation is available in a Github repository: https://github.com/fpetkovski/thanos-sharding-bench
   134  
   135  We then ran the following query on the reference dataset for 10-15 minutes: `sum by (pod) (http_requests_total)`
   136  
   137  The memory usage of Queriers with and without sharding was ~650MB and ~1.5GB respectively, as shown n the screenshots bellow.
   138  
   139  Memory usage with sharding:
   140  
   141  <img src="../img/memory-with-sharding.png" alt="Memory usage with sharding" width="600"/>
   142  
   143  Memory usage without sharding:
   144  
   145  <img src="../img/memory-without-sharding.png" alt="Memory usage without sharding" width="600"/>
   146  
   147  We also found that the sharded implementation reduced query execution time from ~10s to ~5s.
   148  
   149  Latency with sharding:
   150  
   151  <img src="../img/latency-with-sharding.png" alt="Latency with sharding" width="600"/>
   152  
   153  Latency without sharding:
   154  
   155  <img src="../img/latency-without-sharding.png" alt="Latency without sharding" width="600"/>