github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/202108-more-granular-query-performance-metrics.md (about) 1 --- 2 type: proposal 3 title: More granular query performance metrics 4 status: in-progress 5 owner: moadz 6 menu: proposals-done 7 --- 8 9 * **Owners:** 10 * @moadz 11 12 * **Related Tickets:** 13 * [More granular query performance metrics issue by moadz](https://github.com/thanos-io/thanos/issues/1706) 14 * [Accurately Measuring Query Performance by ianbillet](https://github.com/thanos-io/thanos/discussions/4674) 15 16 ## Why 17 18 Currently, within thanos, the only way to measure query latency is to look at the `http_request_duration_seconds` on the `query` and `query_range` HTTP handlers. 19 20 ### Pitfalls of the current solution 21 22 There's a few reason why measuring the latency on the query and query_range endpoints is a poor indicator of how much 'work' Thanos is doing. Thanos will fan-out the 'select' phase of the query to each of its store targets (other stores, sidecars etc.) streaming the series from each target over the network, before executing the query on the resulting stream. This means that we actually touch far more series over the network than is returned to the final query response. 23 24 **An ideal metric for query performance must include the total number of samples and series retrieved remotely before the query is executed.** 25 26 ## Goals 27 * **Must** instrument the aggregate number of samples returned during the select phase of a fanned out query with respect to time to better understand how query latency scales with respect to samples touched 28 * **Must** instrument the aggregate number of series returned during the select phase of fanned out query with respect to time to better understand how query latency scales with respect to number of series touched 29 * **Could** instrument the aggregate response size (bytes) of fanned out requests 30 * This is less critical as response size can be approximated from number of series/number of samples 31 32 ### Audience 33 34 * Thanos' developers and maintainers who would like to better understand the series/sample funnel before a request is processed more systematically 35 * Thanos' users who would like to better understand and enforce SLO's around query performance with respect to number of series/samples a query invariably touches 36 37 ## Non-Goals 38 39 * Any performance related optimizations 40 * Any approach that involves sampling of queries (traces already sample the aforementioned data) 41 42 ## How 43 44 Mercifully, we already capture these stats in [tracing spans from the query path](https://github.com/thanos-io/thanos/blob/de0e3848ff6085acf89a5f77e053c555a2cce550/pkg/store/proxy.go#L393-L397). Unfortunately, traces are sampled and are difficult to query reliably for SLI's. My suggestion is that we add a new histogram, `thanos_store_query_duration_seconds` that persists the total elapsed time for the query, with partitions(labels) for `series_le` and `samples_le` attached to each observation to allow for querying for a particular dimension (e.g. `thanos_query_duration_seconds` for the `0.99` request duration quantile with `> 1,000,000 samples` returned). Ideally we would represent this with an N-dimensional histogram, but this is not possible with prometheus. 45 46 As the [`storepb.SeriesStatsCounter`](https://github.com/thanos-io/thanos/blob/de0e3848ff6085acf89a5f77e053c555a2cce550/pkg/store/storepb/custom.go#L470) already trackes this information in `thanos/store/proxy.go`, we will have to write an aggregator in the same file that can sum the series/samples for each 'query'. As one thanos query is translated into multiple, fan out queries and aggregated into a single response, we will need to do this once for all the queries [here](.). 47 48 ### Given most queries will have highly variable samples/series returned for each query, how do we manage cardinality? 49 50 Yes, in fact if we used the unique samples/series I suspect this metric would be totally useless. Instead we should define a set of 5-10 distinct 't-shirt size' buckets for the samples/series dimensions. 51 52 e.g. Exponential Scale 53 54 ``` 55 Samples: 56 s <= 1000 samples, m <= 10_000 samples, l <= 100_000 samples, xl > 1_000_000 samples 57 58 Series: 59 s <= 10 series, m <= 100 series, m <= 1000 series, l <= 10_000 series, xl <= 100_000 series 60 ``` 61 62 So if a query returns `50,000 samples` spanning `10 series` in `32s` it would be an observation of `32s` with the labels `sample-size=l, series-size=s`. 63 64 This would allow us to define query SLI's with respect to a specific number of series/samples and mean response time 65 66 e.g. 90% of queries for up to 1,000,000 samples and up to 10 series complete in less than 2s. 67 68 ### How will we determine appropriate buckets for our additional dimensions? 69 70 This is my primary concern. Given the countless permutations of node size, data distribution and thanos topologies that might influence the query sizes and response times, it is unlikely any set thresholds will be appropriate for all deployments of Thanos. As a result, the 't-shirt sizes' will have to be configurable (possibly via runtime args) with some sensible defaults to allow users to tweak them. The obvious caveat being that if this metric is recorded, any changes to the bucketing would render previous data corrupted/incomprehensible. I would like guidance here if possible. 71 72 Suggested flags for configuring query performance histogram quantiles 73 74 ``` 75 // Buckets for labelling query performance metrics with respect to duration of query 76 --query.telemetry.request-duration-seconds-quantiles = [ 0.1, 0.25, 0.75, 1.25, 1.75, 2.5, 3, 5, 10 ] 77 78 // Buckets for labelling query performance metrics with respect to number of samples 79 --query.telemetry.sample-quantiles = [ 100, 1_000, 10_000, 100_000, 1_000_000 ] 80 81 // Buckets for labelling query performance metrics with respect to number of series 82 --query.telemetry.series-quantiles = [ 10, 100, 1000, 10_000, 100_000 ] 83 ``` 84 85 ### How will we aggregate all series stats between unique series sets 86 87 With our new metric we: 88 * Do not want to create separate histograms for each individual store query, so they will need to be aggregated at the `Series` request level so that our observations include all 89 * Do not want to block series receive on a `seriesStats` merging mutex for each incoming response, so maintaining a central `seriesStats` reference and passing it into each of the proxied store requests is out of the question 90 91 ### How can we capture the query shape & latency spanning the entire query path? 92 93 PromQL engine accepts a `storage.Queryable` which denotes the source of series data in a particular query/query engine instance. The implementation of the thanos query store proxy implements this interface, providing an API for aggregating series across disparate store targets registered to the store instance. 94 95 As the 'shape' of the query is a consequence of our `storage.Queryable` implementation (backed by the proxy store API), there is no way to pass the `SeriesStats` through the PromQL engine query exec in the Thanos query path without changing the `Querier` prometheus interface. 96 97 Example of how upstream `Querier` interface change would look like if we included the series stats (or some generic representation of SelectFnStats): 98 99 ```go 100 // Querier provides querying access over time series data of a fixed time range. 101 type Querier interface { 102 LabelQuerier 103 104 // Select returns a set of series that matches the given label matchers. 105 // Caller can specify if it requires returned series to be sorted. Prefer not requiring sorting for better performance. 106 // It allows passing hints that can help in optimising select, but it's up to implementation how this is used if used at all. 107 Select(sortSeries bool, hints *SelectHints, matchers ...*labels.Matcher) (SeriesSet, SeriesStats) 108 } 109 ``` 110 111 By amending the prometheus `storage.Querier` interface to include the `SeriesStats` (or some form of it) alongside the `SeriesSet` when a `Select` is performed, all `Queriers` must return stats alongside selects (which may be a good thing but a breaking API change). I want to explore doing this in upstream Prometheus, but the below implementation is an intermediate step to see if this approach is useful at all in capturing the select phase shape/latencies independently. 112 113 Due to the limitations of the prom Querier API, we can instead use the reporter pattern and override the `QueryableCreator` constructor to take an extra function parameter that exfiltrates the series stats from the `Select` function and submits the query time observation in the API query handler. 114 115 **tl;dr:** Longer term to capture the entire query path by amending the Prometheus Querier API to return some stats alongside the query, and creating this generic metric inside the Prometheus PromQL engine. Short term, pass a func parameter to the Queryable constructor for the proxy StoreAPI querier that will exfiltrate the `SeriesStats`, circumventing PromQL engine. 116 117 ### Measuring Thanos Query Latency with respect to query fanout 118 119 First we would create a new `SeriesQueryPerformanceCalculator` for aggregating/tracking the `SeriesStatsCounters` for each fanned out query 120 121 go pseudo: 122 123 ```go 124 type SeriesQueryPerformanceMetricsAggregator struct { 125 QueryDuration *prometheus.HistogramVec 126 127 SeriesLeBuckets []float64 128 SamplesLeBuckets []float64 129 SeriesStats storepb.SeriesStatsCounter 130 } 131 132 func NewSeriesQueryPerformanceMetricsAggregator(reg prometheus.Registerer) *SeriesQueryPerformanceMetrics { 133 return &SeriesQueryPerformanceMetrics{ 134 QueryDuration: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{ 135 Name: "thanos_query_duration_seconds", 136 Help: "duration of the thanos store select phase for a query", 137 Buckets: []float64{0.1, 0.25, 0.5, 1, 2, 3, 5}, // These quantiles will be passed via commandline arg 138 }, []string{"series_le", "samples_le"}), 139 SeriesStats: storepb.SeriesStatsCounter{}, 140 } 141 } 142 143 // Aggregator for merging `storepb.SeriesStatsCounter` for each incoming fanned out query 144 func (s *SeriesQueryPerformanceMetricsAggregator) Aggregate(seriesStats *storepb.SeriesStatsCounter) { 145 s.SeriesStats.Count(seriesStats) 146 } 147 148 // Commit the aggregated SeriesStatsCounter as an observation 149 func (s *SeriesQueryPerformanceMetricsAggregator) Observe(duration float64) { 150 // Bucket matching for series/labels matchSeriesBucket/matchSamplesBucket => float64, float64 151 seriesLeBucket := s.findBucket(s.SeriesStats.Series, &s.SeriesLeBuckets) 152 samplesLeBucket := s.findBucket(s.SeriesStats.Samples, &s.SamplesLeBuckets) 153 s.QueryDuration.With(prometheus.Labels{ 154 "series_le": strconv.Itoa(seriesLeBucket), 155 "samples_le": strconv.Itoa(samplesLeBucket), 156 }).Observe(duration) 157 } 158 159 // Determine the appropriate bucket for a given value relative to a set of quantiles 160 func (s *SeriesQueryPerformanceMetricsAggregator) findBucket(value int, quantiles *[]float64) int 161 ``` 162 163 Current query fanout logic: 164 165 ```go 166 for _, st := range s.stores() { 167 // [Error handling etc.] 168 // Schedule streamSeriesSet that translates gRPC streamed response 169 // into seriesSet (if series) or respCh if warnings. 170 seriesSet = append(seriesSet, startStreamSeriesSet(seriesCtx, reqLogger, span, closeSeries, 171 wg, sc, respSender, st.String(), !r.PartialResponseDisabled, s.responseTimeout, s.metrics.emptyStreamResponses, seriesStatsCh)) 172 } 173 ``` 174 175 [Propagating the `SeriesStats` via `storepb.SeriesServer`](https://github.com/thanos-io/thanos/blob/de0e3848ff6085acf89a5f77e053c555a2cce550/pkg/store/proxy.go#L362): 176 177 ```go 178 type seriesServer struct { 179 // This field just exist to pseudo-implement the unused methods of the interface. 180 storepb.Store_SeriesServer 181 ctx context.Context 182 183 seriesSet []storepb.Series 184 seriesSetStats storepb.SeriesStatsCounter 185 warnings []string 186 } 187 188 func (s *seriesServer) Send(r *storepb.SeriesResponse) error { 189 if r.GetWarning() != "" { 190 s.warnings = append(s.warnings, r.GetWarning()) 191 return nil 192 } 193 194 if r.GetSeries() != nil { 195 s.seriesSet = append(s.seriesSet, *r.GetSeries()) 196 // For each appended series, increment the seriesStats 197 s.seriesSetStats.Count(r.GetSeries()) 198 return nil 199 } 200 201 // Unsupported field, skip. 202 return nil 203 } 204 ``` 205 206 Now that the `SeriesStats` are propagated into the `storepb.SeriesServer`, we can ammend the `selectFn` function to return a tuple of `(storage.SeriesSet, storage.SeriesSetCounter, error)` 207 208 Ammending the QueryableCreator to provide a func parameter: 209 210 ```go 211 type SeriesStatsReporter func(seriesStats storepb.SeriesStatsCounter) 212 213 type QueryableCreator func(deduplicate bool, replicaLabels []string, storeDebugMatchers [][]*labels.Matcher, maxResolutionMillis int64, partialResponse, skipChunks bool, seriesStatsReporter SeriesStatsReporter) storage.Queryable 214 215 // NewQueryableCreator creates QueryableCreator. 216 func NewQueryableCreator(logger log.Logger, reg prometheus.Registerer, proxy storepb.StoreServer, maxConcurrentSelects int, selectTimeout time.Duration, seriesStatsReporter SeriesStatsReporter) QueryableCreator { 217 duration := promauto.With( 218 extprom.WrapRegistererWithPrefix("concurrent_selects_", reg), 219 ).NewHistogram(gate.DurationHistogramOpts) 220 221 return func(deduplicate bool, replicaLabels []string, storeDebugMatchers [][]*labels.Matcher, maxResolutionMillis int64, partialResponse, skipChunks bool) storage.Queryable { 222 return &queryable{ 223 logger: logger, 224 replicaLabels: replicaLabels, 225 storeDebugMatchers: storeDebugMatchers, 226 proxy: proxy, 227 deduplicate: deduplicate, 228 maxResolutionMillis: maxResolutionMillis, 229 partialResponse: partialResponse, 230 skipChunks: skipChunks, 231 gateProviderFn: func() gate.Gate { 232 return gate.InstrumentGateDuration(duration, promgate.New(maxConcurrentSelects)) 233 }, 234 maxConcurrentSelects: maxConcurrentSelects, 235 selectTimeout: selectTimeout, 236 seriesStatsReporter: seriesStatsReporter, 237 } 238 } 239 } 240 ``` 241 242 Injecting the reporter into the qapi Queryable static constructor: 243 244 ```go 245 var ( 246 ssmtx sync.Mutex 247 seriesStats []storepb.SeriesStatsCounter 248 ) 249 seriesStatsReporter := func(ss storepb.SeriesStatsCounter) { 250 ssmtx.Lock() 251 defer ssmtx.Unlock() 252 253 seriesStats = append(seriesStats, ss) 254 } 255 qry, err := qe.NewRangeQuery( 256 qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, maxSourceResolution, enablePartialResponse, false, seriesStatsReporter), 257 r.FormValue("query"), 258 start, 259 end, 260 step, 261 ) 262 ``` 263 264 In summary, we will: 265 * Amend the `seriesServer` to keep track of all `SeriesStats` for each series pushed to it 266 * Amend the static `qapi.queryableCreate` to take a `SeriesStatsReporter` func parameter that will exfiltrate the seriesStats from the Thanos Proxy StoreAPI 267 * Add new runtime flags that will allow us to specify a) Query time quantiles b) Series size quantiles c) Sample size quantiles for our partitioned histogram 268 * Start a query duration timer as soon as the handler is hit 269 * Create a new partitioned vector histogram called `thanos_query_duration_seconds` in the `queryRange` API handler 270 * Propagate all exfiltrated `SeriesStats` to aforementioned metric 271 * Record observations against the `thanos_query_duration_seconds` histogram after bucketing samples_le/series_le buckets 272 273 # Alternatives 274 275 * Add the `seriesStats` to the `SeriesResponse` and aggregate it in the `responseCh`, this would enable us to propagate the seriesStats further up/in other components if need be