github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/202301-distributed-query-execution.md (about)

     1  ---
     2  type: proposal
     3  title: Distributed Query Execution
     4  status: accepted
     5  owner: fpetkovski
     6  menu: proposals-accepted
     7  ---
     8  
     9  ## 1 Related links/tickets
    10  
    11  * https://github.com/thanos-io/thanos/pull/5250
    12  * https://github.com/thanos-io/thanos/pull/4917
    13  * https://github.com/thanos-io/thanos/pull/5350
    14  * https://github.com/thanos-community/promql-engine/issues/25
    15  
    16  ## 2 Why
    17  
    18  Thanos Queriers currently need to pull in all data from Stores in memory before they can start evaluating a query. This has a large impact on the used memory inside a single querier, and drastically increases query execution latency.
    19  
    20  Even when a Querier is connected to other Queriers, it will still pull raw series instead of delegating parts of the execution to its downstreams. This document proposes a mode in the Thanos Querier where it will dispatch parts of the execution plan to different, independent Queriers.
    21  
    22  ## 3 Pitfalls of current solutions
    23  
    24  We have two mechanisms in Thanos to distribute queries among different components.
    25  
    26  Query pushdown is a mechanism enabled by query hints which allows a Thanos sidecar to execute certain queries against Prometheus as part of a `Series` call. Since data is usually replicated in at least two Prometheus instances, the subset of queries that can be pushed down is quite limited. In addition to that, this approach has introduced additional complexity in the deduplication iterator to allow the Querier to distinguish between storage series and PromQL series.
    27  
    28  Query Sharding is a execution method initiated by Query Frontend and allows for an aggregation query with grouping labels to be distributed to different Queriers. Even though the number of queries that can be sharded is larger than the ones that can be pushed down, query sharding still has a limited applicability since a query has to contain grouping labels. We have also noticed in practice that the execution latency does not fall linearly with the number of vertical shards, and often plateaus off at around ~4 shards. This is especially pronounced when querying data from Store Gateways, likely due to amplifying `Series` calls against Store components.
    29  
    30  ## 4 Audience
    31  
    32  * Thanos users who have challenges with evaluating PromQL queries due to high cardinality.
    33  
    34  ## 5 Goals
    35  
    36  * Enable decentralized query execution by delegating query plan fragments to independent Queriers.
    37  
    38  ## 6 Proposal
    39  
    40  The key advantage of distributed execution is the fact that the number of series is drastically reduced when a query contains an aggregation operator (`sum`, `group`, `max`, etc..). Most (if not all) high cardinality PromQL queries are in-fact aggregations since users will struggle to sensibly visualise more than a handful of series.
    41  
    42  We therefore propose an execution model that allows running a Thanos Querier in a mode where it transforms a query to subqueries which are delegated to independent Queriers, and a central aggregation that is executed locally on the result of all subqueries. A simple example of this transformation is a `sum(rate(metric[2m]))` expression which can be transformed as
    43  
    44  ```
    45  sum(
    46    coalesce(
    47      sum(rate(metric[2m])),
    48      sum(rate(metric[2m]))
    49    )
    50  )
    51  ```
    52  
    53  ### How
    54  
    55  The proposed method of transforming the query is extending the Thanos Engine with a logical optimizer that has references to other query engines. An example API could look as follows:
    56  
    57  ```
    58  type DistributedExecutionOptimizer struct {
    59  	Endpoints api.RemoteEndpoints
    60  }
    61  
    62  type RemoteEndpoints interface {
    63  	Engines() []RemoteEngine
    64  }
    65  
    66  type RemoteEngine interface {
    67  	NewInstantQuery(opts *promql.QueryOpts, qs string, ts time.Time) (promql.Query, error)
    68  	NewRangeQuery(opts *promql.QueryOpts, qs string, start, end time.Time, interval time.Duration) (promql.Query, error)
    69  }
    70  ```
    71  
    72  The implementation of the `RemoteEngine` will be provided by Thanos itself and will use the gRPC Query API added in [https://github.com/thanos-io/thanos/pull/5250](https://github.com/thanos-io/thanos/pull/5250).
    73  
    74  Keeping PromQL execution in Query components allows for deduplication between Prometheus pairs to happen before series are aggregated.
    75  
    76  <img src="../img/distributed-execution-proposal-1.png" alt="Distributed query execution" width="400"/>
    77  
    78  The initial version of the solution can be found here: https://github.com/thanos-community/promql-engine/pull/139
    79  
    80  ### Query rewrite algorithm
    81  
    82  As described in the section above, the query will be rewritten using a logical optimizer into a form that is suitable for distributed execution.
    83  
    84  The proposed algorithm is as follows:
    85  * Start AST traversal from the bottom up.
    86  * If both the current node and its parent can be distributed, move up to the parent.
    87  * If the current node can be distributed and its parent cannot, rewrite the current node into its distributed form.
    88  * If the current node cannot be distributed, stop traversal.
    89  
    90  With this algorithm we try to distribute as much of the PromQL query as possible. Furthermore, even queries without aggregations, like `rate(http_requests_total[2m])`, will be rewritten into
    91  
    92  ```
    93  coalesce(
    94    rate(http_requests_total[2m]),
    95    rate(http_requests_total[2m])
    96  )
    97  ```
    98  
    99  Since PromQL queries are limited in the number of steps they can evaluate, with this algorithm we achieve downsampling at query time since only a small number of samples will be sent from local Queriers to the central one.
   100  
   101  ### Time-based overlap resolution
   102  
   103  Thanos stores usually have a small overlap with ingestion components (Prometheus/Receiver) due to eventually consistency from uploading and downloading TSDB blocks. As a result, the central aggregation needs a way to deduplicate samples between ingestion and storage components.
   104  
   105  The proposed way to do time-based deduplication is by removing identical samples in the `coalesce` operator in the Thanos Engine itself. In order for data from independent Queriers to not get deduplicated, aggregations happening in remote engines must always preserve external labels from TSDB blocks that are being queried.
   106  
   107  To illustrate this on an example, we can assume that we have two clusters `a` and `b`, each being monitored with a Prometheus pair and with each Prometheus instance having an external `cluster` label. The query `sum(rate(metric[2m]))` would then be rewritten by the optimizer into:
   108  
   109  ```
   110  sum(
   111    coalesce(
   112      sum by (cluster) (rate(metric[2m])),
   113      sum by (cluster) (rate(metric[2m]))
   114    )
   115  )
   116  ```
   117  
   118  Each subquery would preserve the external `cluster` label which will allow the `coalesce` operator to deduplicate only those samples which are calculated from the same TSDB blocks. External labels can be propagated to the central engine by extending the `RemoteEngine` interface with a `Labels() []string` method. With this approach, local Queriers can be spread as widely as needed, with the extreme case of having one Querier per deduplicated TSDB block.
   119  
   120  <img src="../img/distributed-execution-proposal-2.png" alt="Distributed query execution" width="400"/>
   121  
   122  ## Deployment models
   123  
   124  With this approach, a Thanos admin can arrange remote Queriers in an arbitrary way, as long as TSDB replicas are always queried by only one remote Querier. The following deployment models can be used as examples:
   125  
   126  #### Monitoring different environments with Prometheus pairs
   127  
   128  In this deployment mode, remote queriers are attached to pairs of Prometheus instances. The central Querier delegates subqueries to them and performs a central aggregation of results.
   129  
   130  <img src="../img/distributed-execution-proposal-4.png" alt="Distributed query execution" width="400"/>
   131  
   132  #### Querying separate Store Gateways and Prometheus pairs
   133  
   134  Remote Queriers can be attached to Prometheus pairs and Store Gateways at the same time. The central querier delegates subqueries and deduplicates overlapping results before performing a central aggregation.
   135  
   136  <img src="../img/distributed-execution-proposal-3.png" alt="Distributed query execution" width="400"/>
   137  
   138  #### Running remote Queriers as Store Gateway sidecars
   139  
   140  Remote Queriers can be attached to disjoint groups of Store Gateways. They can even be attached to individual Store Gateways which have deduplicated TSDB blocks, or hold all replicas of a TSDB block. This will make sure penalty-based deduplication happens in the remote querier.
   141  
   142  Store groups can be created by either partitioning TSDBs by time (time-based partitioning), or by external labels. Both of these techniques are documented in the [Store Gateway documentation](https://thanos.io/tip/components/store.md/#time-based-partitioning).
   143  
   144  <img src="../img/distributed-execution-proposal-5.png" alt="Distributed query execution" width="400"/>
   145  
   146  ### Distributed execution against Receive components
   147  
   148  We currently lack the mechanism to configure a Querier against a subset of TSDBs, unless that Querier is exclusively attached to Stores that have those TSDBs. In the case of Receivers, TSDBs are created and pruned dynamically, which makes it hard to apply the distributed query model against this component.
   149  
   150  To resolve this issue, this proposal suggests adding a `"selector.relabel-config` command-line flag to the Query component that will work the same way as the Store Gateway selector works. For each query, the Querier will apply the given relabel config against each Store's external label set and decide whether to keep or drop a TSDB from the query. After the relabeling is applied, the query will be rewritten to target only those TSDBs that match the selector.
   151  
   152  An example config that only targets TSDBs with external labels `tenant=a` would be:
   153  
   154  ```
   155  - source_labels: [tenant]
   156    action: keep
   157    regex: a
   158  ```
   159  
   160  With this mechanism, a user can run a pool of Queriers with a selector config as follows:
   161  
   162  ```
   163  - source_labels: [ext_label_a, ext_label_b]
   164    action: hashmod
   165    target_label: query_shard
   166    modulus: ${query_shard_replicas}
   167  - action: keep
   168    source_labels: [query_shard]
   169    regex: ${query_shard_instance}
   170  ```
   171  
   172  <img src="../img/distributed-execution-proposal-6.png">
   173  
   174  This approach can also be used to create Querier shards against Store Gateways, or any other pool of Store components.
   175  
   176  ## 7 Alternatives
   177  
   178  A viable alternative to the proposed method is to add support for Query Pushdown in the Thanos Querier. By extracting better as described in https://github.com/thanos-io/thanos/issues/5984, we can decide to execute a query in a local Querier, similar to how the sidecar does that against Prometheus.
   179  
   180  Even though this approach might be faster to implement, it might not be the best long-term solution due to several reasons. To some extent, Query Pushdown misuses the `Series` API and the Querier requesting series is not aware that the query was actually executed. This can be problematic for distributing something like `count(metric)` since the distributed version should end up as:
   181  
   182  ```
   183  sum(
   184    coalesce(
   185      count(metric),
   186      count(metric)
   187    )
   188  )
   189  ```
   190  
   191  The root querier would need to know that downstream queriers have already executed the `count` and should convert the aggregation into a `sum`
   192  
   193  A similar problem can happen with a `sum(rate(metric[2m]))` expression where downstream queriers calculate the `sum` over the metric's `rate`. In order for the values to not get rated twice, either the downstream queriers need to invert the rate into a cumulative value, or the central querier needs to omit the rate and only calcualte the sum.
   194  
   195  Managing this complexity in Thanos itself seems error prone and hard to maintain over time. As a result, this proposal suggests to localize the complexity into a single logical optimizer as suggested in the sections above.
   196  
   197  Depending on the success of the distributed execution model, we can also fully deprecate query pushdown and query sharding and replace them with a single mechanism that can evolve and improve over time.