github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/proposals/scalable-query-frontend.md (about) 1 --- 2 title: "Scalable Query Frontend" 3 linkTitle: "Scalable Query Frontend" 4 weight: 1 5 slug: scalable-query frontend 6 --- 7 8 - Author: [Joe Elliott](https://github.com/joe-elliott) 9 - Date: April 2020 10 - Status: Proposed 11 12 ## Overview 13 This document aims to describe the [role](#query-frontend-role) that the Cortex Query Frontend plays in running multitenant Cortex at scale. It also describes the [challenges](#challenges-and-proposals) of horizontally scaling the query frontend component and includes several recommendations and options for creating a reliably scalable query-frontend. Finally, we conclude with a discussion of the overall philosophy of the changes and propose an [alternative](#alternative). 14 15 For the original design behind the query frontend, you should read [Cortex Query Optimisations design doc from 2018-07](https://docs.google.com/document/d/1lsvSkv0tiAMPQv-V8vI2LZ8f4i9JuTRsuPI_i-XcAqY). 16 17 18 ## Reasoning 19 20 Query frontend scaling is becoming increasingly important for two primary reasons. 21 22 The Cortex team is working toward a scalable single binary solution. Recently the query-frontend was [added](https://github.com/cortexproject/cortex/pull/2437) to the Cortex single binary mode and, therefore, needs to seamlessly scale. Technically, nothing immediately breaks when scaling the query-frontend, but there are a number of concerns detailed in [Challenges And Proposals](#challenges-and-proposals). 23 24 As the query-frontend continues to [support additional features](https://github.com/cortexproject/cortex/pull/1878) it will start to become a bottleneck of the system. Current wisdom is to run very few query-frontends in order to maximize [Tenancy Fairness](#tenancy-fairness) but as more features are added scaling horizontally will become necessary. 25 26 ## Query Frontend Role 27 28 ### Load Shedding 29 30 The query frontend maintains a queue per tenant of configurable length (default 100) in which it stores a series of requests from that tenant. If this queue fills up then the frontend will return 429’s thus load shedding the rest of the system. 31 32 This is particularly effective due to the “pull” based model in which queriers pull requests from query frontends. 33 34 ### Query Retries 35 36 The query frontend is capable of retrying a query on another querier if the first should fail due to OOM or network issues. 37 38 ### Sharding/Parallelization 39 40 The query frontend shards requests by interval and [other factors](https://github.com/cortexproject/cortex/pull/1878) to concurrently run a single query across multiple queriers. 41 42 ### Query Alignment/Caching 43 44 Queries are aligned to their own step and then stored/retrieved from cache. 45 46 ### Tenancy Fairness 47 48 By maintaining one queue per tenant, a low demand tenant will have the same opportunity to have a query serviced as a high demand tenant. See [Dilutes Tenant Fairness](#dilutes-tenant-fairness) for additional discussion. 49 50 For clarity, tenancy fairness only comes into play when queries are actually being queued in the query frontend. Currently this rarely occurs, but as [query sharding](https://github.com/cortexproject/cortex/pull/1878) becomes more aggressive this may become the norm. 51 52 ## Challenges And Proposals 53 54 ### Dynamic Querier Concurrency 55 56 #### Challenge 57 58 For every query frontend the querier adds a [configurable number of goroutines](https://github.com/cortexproject/cortex/blob/50f53dba8f8bd5f62c0e85cc5d85684234cd1c1c/pkg/querier/frontend/worker.go#L146) which are each capable of executing a query. Therefore, scaling the query frontend impacts the amount of work each individual querier is attempting to do at any given time. 59 60 Scaling up may cause a querier to attempt more work than they are configured for due to restrictions such as memory and cpu limits. Additionally, the promql engine itself is limited in the number of queries it can do as configured by the `-querier.max-concurrent` parameter. Attempting more queries concurrently than this value causes the queries to queue up in the querier itself. 61 62 For similar reasons scaling down the query frontend may cause a querier to not use its allocated memory and cpu effectively. This will lower effective resource utilization. Also, because individual queriers will be doing less work, this may cause increased queueing in the query frontends. 63 64 #### Proposal 65 66 Currently queriers are configured to have a [max parallelism per query frontend](https://github.com/cortexproject/cortex/blob/50f53dba8f8bd5f62c0e85cc5d85684234cd1c1c/pkg/querier/frontend/worker.go#L146). An additional “total max concurrency” flag should be added. 67 68 Total Max Concurrency would then be evenly divided amongst all available query frontends. This would decouple the amount of work a querier is attempting to do with the number of query frontends that happen to exist at this moment. Consequently this would allow allocated resources (e.g. k8s cpu/memory limits) to remain balanced with the work the querier was attempting as the query frontend is scaled up or down. 69 70 A [PR](https://github.com/cortexproject/cortex/pull/2456) has already been merged to address this. 71 72 ### Overwhelming PromQL Concurrency 73 74 #### Challenge 75 76 If #frontends > promql concurrency then the queriers are incapable of devoting even a single worker to each query frontend without risking queueing in the querier. Queuing in the querier is a highly undesirable state and one of the primary reasons the query frontend was originally created. 77 78 #### Proposal 79 80 When #frontends > promql concurrency then each querier will maintain [exactly one connection](https://github.com/cortexproject/cortex/blob/8fb86155a7c7c155b8c4d31b91b267f9631b60ba/pkg/querier/frontend/worker.go#L194-L200) to every frontend. As the query frontend is [currently coded](https://github.com/cortexproject/cortex/blob/8fb86155a7c7c155b8c4d31b91b267f9631b60ba/pkg/querier/frontend/frontend.go#L279-L332) it will attempt to use every open GRPC connection to execute a query in the attached queriers. Therefore, in this situation where #frontends > promql concurrency, the querier is exposing itself to more work then it is actually configured to perform. 81 82 To prevent this we will add “flow control” information to the [ProcessResponse message](https://github.com/cortexproject/cortex/blob/master/pkg/querier/frontend/frontend.proto#L21) that is used to return query results from the querier to the query frontend. In an active system this message is passed multiple times per second from the queriers to the query frontends and would be a reliable way for the frontends to track the state of queriers and balance load. 83 84 There are a lot of options for an exact implementation of this idea. An effective solution should be determined and chosen by modeling a set of alternatives. The details of this would be included in another design doc. A simple implementation would look something like the following: 85 86 Add two new fields to [ProcessResponse](https://github.com/cortexproject/cortex/blob/master/pkg/querier/frontend/frontend.proto#L21): 87 88 89 ```protobuf 90 message ProcessResponse { 91 httpgrpc.HTTPResponse httpResponse = 1; 92 currentConcurrency int = 2; 93 desiredConcurrency int = 3; 94 } 95 ``` 96 97 **currentConcurrency** - The current number of queries being executed by the querier. 98 99 **desiredConcurrency** - The total number of queries that a querier is capable of executing. 100 101 Add a short backoff to the main frontend [processing loop](https://github.com/cortexproject/cortex/blob/8fb86155a7c7c155b8c4d31b91b267f9631b60ba/pkg/querier/frontend/frontend.go#L288-L331). This would cause the frontend to briefly back off of any querier that was overloaded but continue to send queries to those that were capable of doing work. 102 103 ```go 104 if current > desired { 105 zzz := (current - desired) * backoffDuration 106 zzz *= 1 + rand.Float64() * .1 // jitter 107 time.Sleep(zzz) 108 } 109 ``` 110 111 Passing flow control information from the querier to the frontend would also open up additional future work for more sophisticated load balancing across queriers. For example by simply comparing and choosing [the least congested of two](https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/) queriers we could dramatically improve how well work is distributed. 112 113 ### Increased Time To Failure 114 115 #### Challenge 116 117 Scaling the query frontend also increases the per tenant queue length by creating more queues. This could result in increased latencies where failing fast (429) would have been preferred. 118 119 The operator could reduce the queue length per query frontend in response to scaling out, but then they would run the risk of unnecessarily failing a request due to unbalanced distribution across query frontends. Also, shorter queues run the risk of failing to properly service heavily sharded queries. 120 121 Another concern is that a system with more queues will take longer to recover from an production event as it will have queued up more work. 122 123 #### Proposal 124 125 Currently we are not proposing any changes to alleviate this concern. We believe this is solvable operationally. This can be revisited as more information is gathered. 126 127 ### Querier Discovery Lag 128 129 #### Challenge 130 131 Queriers have a configurable parameter that controls how often they refresh their query frontend list. The default value is 10 seconds. After a new query frontend is added the average querier will take 5 seconds (after DNS is updated) to become aware of it and begin requesting queries from it. 132 133 #### Proposal 134 135 It is recommended to add a readiness/health check to the query frontend to prevent it from receiving queries while it is waiting for queriers to connect. HTTP health checks are supported by [envoy](https://www.envoyproxy.io/learn/health-check), [k8s](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/), [nginx](https://docs.nginx.com/nginx/admin-guide/load-balancer/http-health-check/), and basically any commodity load balancer. The query frontend would not indicate healthy on its health check until at least one querier had connected. 136 137 In a k8s environment this will require two services. One service for discovery with `publishNotReadyAddresses` set to true and one service for load balancing which honors the healthcheck/readiness probe. After a new query-frontend instance is created the "discovery service" would immediately have the ip of the new instance which would allow queriers to discover and attach to it. After queriers had connected it would then raise its readiness probe and appear on the "load balancing" service and begin receiving traffic. 138 139 140 ### Dilutes Tenant Fairness 141 142 #### Challenge 143 144 Given `f` query frontends, `n` tenants and an average of `q` queries in the frontend per tenant. The following assumes that queries are perfectly distributed across query frontends. The number of tenants per instance would be: 145 146 <img src="https://render.githubusercontent.com/render/math?math=m = floor(n * \frac{min(q,f)}{f})"> 147 148 The chance that a query by a tenant with `Q` queries in the frontend is serviced next is: 149 150 <img src="https://render.githubusercontent.com/render/math?math=min(Q,f)* \frac{1}{min(q * n %2b Q,f)}*\frac{1}{m %2b 1}"> 151 152 Note that fewer query frontends caps the impact of the number of active queries per tenant. If there is only one query frontend then the equation reduces to: 153 154 <img src="https://render.githubusercontent.com/render/math?math=\frac{1}{n}"> 155 156 and every tenant has an equal chance of being serviced regardless of the number of queued queries. 157 158 Adding more query frontends favors high volume tenants by giving them more slots to be picked up by the next available querier. Fewer query frontends allows for an even playing field regardless of the number of active queries. 159 160 For clarity, it should be noted that tenant fairness is only impacted if queries are being queued in the frontend. Under normal operations this is currently not occurring although this may change with increased sharding. 161 162 #### Proposal 163 164 Tenancy fairness is complex and is currently _not_ impacting our system. Therefore we are proposing a very simple improvement to the query frontend. If/when frontend queuing becomes more common this can be revisited as we will understand the problem better. 165 166 Currently the query frontend [picks a random tenant](https://github.com/cortexproject/cortex/blob/50f53dba8f8bd5f62c0e85cc5d85684234cd1c1c/pkg/querier/frontend/frontend.go#L362-L367) to service when a querier requests a new query. This can increase long tail latency if a tenant gets “unlucky” and is also exacerbated for low volume tenants by scaling the query frontend. Instead the query frontend could use a round robin approach to choose the next tenant to service. Round robin is a commonly used algorithm to increase fairness in scheduling. 167 168 This would be a very minor improvement, but would give some guarantees to low volume tenants that their queries would be serviced. This has been proposed in this [issue](https://github.com/cortexproject/cortex/issues/2431). 169 170 **Pros:** Requires local knowledge only. Easier to implement than weighted round robin. 171 172 **Cons:** Improvement is minor. 173 174 **Alternatives to Round Robin** 175 176 **Do Nothing** 177 178 As is noted above tenancy fairness only comes into play when queries start queueing up in the query frontend. Internal Metrics for multi-tenant Cortex at Grafana show that this has only happened 5 times in the past week significantly enough to have been caught by Prometheus. 179 180 Right now doing nothing is a viable option that will, almost always, fairly serve our tenants. There is, however, some concern that as sharding becomes more commonplace queueing will become more common and QOS will suffer due to reasons outlined in [Dilutes Tenant Fairness](#dilutes-tenant-fairness). 181 182 **Pros:** Easy! 183 184 **Cons:** Nothing happens! 185 186 **Weighted Round Robin** 187 188 The query frontends could maintain a local record of throughput or work per tenant. Tenants could then be sorted in QOS bands. In its simplest form there would be two QOS bands. The band of low volume tenants would be serviced twice for every one time the band of high volume tenants would be serviced. The full details of this approach would require a separate proposal. 189 190 This solution would also open up interesting future work. For instance, we could allow operators to manually configure tenants into QOS bands. 191 192 **Pros:** Requires local knowledge only. Can be extended later to allow tenants to be manually sorted into QOS tiers. 193 194 **Cons:** Improvement is better than Round Robin only. Relies on even distribution of queries across frontends. Increased complexity and difficulty in reasoning about edge cases. 195 196 **Weighted Round Robin With Gossiped Traffic** 197 198 This approach would be equivalent to Weighted Round Robin proposed above but with tenant traffic volume gossiped between query frontends. 199 200 **Pros:** Benefits of Weighted Round Robin without the requirement of even query distribution. Even though it requires distributed information a failure in gossip means it gracefully degrades to Weighted Round Robin. 201 202 **Cons:** Requires cross instance communication. Increased complexity and difficulty in reasoning about edge cases. 203 204 ## Alternative 205 206 The proposals in this document have preferred augmenting existing components to make decisions with local knowledge. The unstated goal of these proposals is to build a distributed queue across a scaled query frontend that reliably and fairly serves our tenants. 207 208 Overall, these proposals will create a robust system that is resistant to network partitions and failures of individual pieces. However, it will also create a complex system that could be difficult to reason about, contain hard to ascertain edge cases and nuanced failure modes. 209 210 The alternative is, instead of building a distributed queue, to add a new cortex queueing service that sits in between the frontends and the queriers. This queueing service would pull from the frontends and distribute to the queriers. It would decouple the stateful queue from the stateless elements of the query frontend and allow us to easily scale the query frontend while keeping the queue itself a singleton. In a single binary HA mode one (or few) of the replicas would be leader elected to serve this role. 211 212 Having a singleton queue is attractive because it is simple to reason about and gives us a single place to make fair cross tenant queueing decisions. It does, however, create a single point of failure and add another network hop to the query path. 213 214 ## Conclusion 215 216 In this document we reviewed the [reasons the frontend exists](#query-frontend-role), [challenges and proposals to scaling the frontend](#challenges-and-proposals) and [an alternative architecture that avoids most problems but comes with its own challenges.](#alternative) 217 218 <table> 219 <tr> 220 <td><strong>Challenge</strong> 221 </td> 222 <td><strong>Proposal</strong> 223 </td> 224 <td><strong>Status</strong> 225 </td> 226 </tr> 227 <tr> 228 <td>Dynamic Querier Concurrency 229 </td> 230 <td>Add Max Total Concurrency in Querier 231 </td> 232 <td><a href="https://github.com/cortexproject/cortex/pull/2456">Pull Request</a> 233 </td> 234 </tr> 235 <tr> 236 <td>Overwhelming PromQL Concurrency 237 </td> 238 <td>Queriers Coordinate Concurrency with Frontends 239 </td> 240 <td>Proposed 241 </td> 242 </tr> 243 <tr> 244 <td>Increased Time to Failure 245 </td> 246 <td>Operational/Configuration Issue. No Changes Proposed. 247 </td> 248 <td> 249 N/A 250 </td> 251 </tr> 252 <tr> 253 <td>Querier Discovery Lag 254 </td> 255 <td>Query Frontend HTTP Health Checks 256 </td> 257 <td><a href="https://github.com/cortexproject/cortex/pull/2733">Pull Request</a> 258 </td> 259 </tr> 260 <tr> 261 <td>Dilutes Tenant Fairness 262 </td> 263 <td>Round Robin with additional alternatives proposed 264 </td> 265 <td><a href="https://github.com/cortexproject/cortex/pull/2553">Pull Request</a> 266 </td> 267 </tr> 268 </table>