github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170908_sql_optimizer_statistics.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170908_sql_optimizer_statistics.md (about)

1 - Feature Name: SQL Optimizer Statistics
2 - Status: in-progress
3 - Start Date: 2017-09-08
4 - Authors: Peter Mattis, Radu Berinde
5 - RFC PR: [#18399](https:////github.com/cockroachdb/cockroach/pull/18399)
6 - Cockroach Issue: (one or more # from the issue tracker)
7
8 # Summary
9
10 This RFC describes the motivations and mechanisms for collecting
11 statistics for use in powering SQL optimization decisions. The
12 potential performance impact is very large as the decisions an
13 optimizer makes can provide orders of magnitude speed-ups to queries.
14
15 # Motivation
16
17 Modern SQL optimizers seek the lowest cost for a query. The cost for a
18 query is usually related to time, but is usually not directly
19 expressed in units of time. For example, the cost might be in units of
20 disk seeks or RPCs or some unnamed unit related to I/O. A cost model
21 estimates the cost for a particular query plan and the optimizer seeks
22 to find the lowest cost plan from among many alternatives. The input
23 to the cost model are the query plan and statistics about the table
24 data that can guide selection between alternatives.
25
26 One example of where statistics guide query optimization is in join
27 ordering. Consider the natural join:
28
29 `SELECT * FROM a NATURAL JOIN b`
30
31 In the absence of other opportunities, this might be implemented as a
32 hash join. With a hash join, we want to load the smaller set of rows
33 (either from `a` or `b`) into the hash table and then query that table
34 while looping through the larger set of rows. How do we know whether
35 `a` or `b` is larger? We keep statistics about the cardinality of `a`
36 and `b`.
37
38 Simple table cardinality is sufficient for the above query but fails
39 in other queries. Consider:
40
41 `SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10`
42
43 Table statistics might indicate that `a` contains 10x more data than
44 `b`, but the predicate `a.y > 10` is filtering a chunk of the
45 table. What we care about is whether the result of the scan of `a`
46 after filtering returns more rows than the scan of `b`. This can be
47 accomplished by making a determination of the selectivity of the
48 predicate `a.y > 10` and then multiplying that selectivity by the
49 cardinality of `a`. The common technique for estimating selectivity is
50 to collect a histogram on `a.y` (prior to running the query).
51
52 A final example is the query:
53
54 `SELECT * FROM system.rangelog WHERE rangeID = ? ORDER BY timestamp DESC LIMIT 100`
55
56 Currently, `system.rangelog` does not have an index on `rangeID`, but
57 does have an index (the primary key) on `timestamp`. This query is
58 planned so that it sequentially walks through the ranges in descending
59 timestamp order until it fills up the limit. If there are a large
60 number of ranges in the system, the selectivity of `rangeID = ?` will
61 be low and we can make an estimation of how many ranges we'll need to
62 query in order to find 100 rows. Rather than walking through the
63 ranges sequentially, we can query batches of ranges concurrently where
64 the size of batches is calculated from the expected number of matches
65 per range.
66
67 These examples are simple and clear. The academic literature is
68 littered with additional details. How do you estimate the selectivity
69 of multiple conjunctive or disjunctive predicates? How do you estimate
70 the number of rows output from various operators such as joins and
71 aggregations? Some of these questions are beyond the scope of this RFC
72 which is focused on the collection of basic statistics.
73
74 # Terminology
75
76 * Selectivity: The number of values that pass a predicate divided by
77 the total number of values. The selectivity multiplied by the
78 cardinality can be used to compute the total number of values after
79 applying the predicate to set of values. The selectivity is
80 interesting independent of the number of values because
81 selectivities for multiple predicates can be combined.
82
83 * Histogram: Histograms are the classic data structure for computing
84 selectivity. One kind that is frequently used is the equi-depth
85 histogram, where each bucket contains approximately (or exactly) the
86 same number of values. More complex histograms exist that aim to
87 provide more accuracy in various data distributions, e.g. "max-diff"
88 histograms. A simple one-pass algorithm exists for constructing an
89 equi-depth histogram for ordered values. A histogram for unordered
90 values can be constructed by first sampling the data (e.g. using
91 reservoir sampling), sorting it and then using the one-pass
92 algorithm on the sorted data.
93
94 * Cardinality: the number of distinct values (on a single column, or
95 on a group of columns).
96
97 * Sketch: Used for cardinality estimation. Algorithms such as
98 [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
99 use a hash of the value to estimate the number of distinct values.
100 Sketches can be merged allowing for parallel computation. For
101 example, the sketches for distinct values on different ranges can be
102 combined into a single sketch for an entire table.
103
104 # Design considerations
105
106 * Statistics need to be available on every node performing query
107 optimization (i.e. every node receiving SQL queries).
108
109 * Statistics collection needs to be low overhead but not necessarily
110 the lowest overhead. Statistics do not need to be updated in real
111 time.
112
113 * Statistics collection should be decoupled from consumption. A simple
114 initial implementation of statistics collection should not preclude
115 more complex collection in the future.
116
117 * The desired statistics to collect are:
118 - a histogram for a given column
119 - a cardinality for a given set of columns
120
121 * The count of distinct values for a column or group of columns can be
122 computed using a sketch algorithm such as
123 [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). Such
124 sketches can be merged making it feasible to collect a sketch at the
125 range-level and combine the sketches to create a table-level sketch.
126
127 * Optimizing `AS OF SYSTEM TIME` queries implies being able to
128 retrieve statistics at a point in time. The statistics do not need
129 to be perfectly accurate, but should be relatively close. On the
130 other hand, historical queries are relatively rare right now.
131
132 # Design overview
133
134 This RFC focuses on the infrastructure and does not set a policy for
135 what stats to collect or when to collect them; these are discussed
136 under future work.
137
138 A single node per table will act as the central coordinator for
139 gathering table-level stats. This node sets up a DistSQL operation
140 that involves reading the primary index and sampling rows (as well as
141 computing sketches). While performed at a specific timestamp, the
142 table read operations explicitly skip the timestamp cache in order to
143 avoid disturbing foreground traffic.
144
145 [TBD: which node acts at the stat coordinator for a table? Presumably
146 there is some sort of lease based on the table ID. Can possibly reuse
147 or copy whatever mechanism is used for schema changes.]
148
149 A new system table provides storage for stats:
150 `system.table_statistics`. This table contains the count of distinct
151 values, NULL values and the histogram buckets for a column/index.
152
153 During query optimization, the stats for a table are retrieved from a
154 per-node stats cache. The cache is populated on demand from the
155 `system.table_statistics` table and refreshed periodically.
156
157 Pseudo-stats are used whenever actual statistics are unavailable. For
158 example, the selectivity of a less-then or greater-than predicate is
159 estimated as 1/3. The academic literature provides additional
160 heuristics to use when stats are unavailable.
161
162 An Admin UI endpoint is provided for visualizing the statistics for a
163 table.
164
165 ## Detailed design
166
167 ### `system.table_statistics`
168
169 The table-level statistics are stored in a new
170 `system.table_statistics` table. This table contains a row per
171 attribute or group of attributes that we maintain stats for. The
172 schema is:
173
174 ```
175 CREATE TABLE system.table_statistics (
176 table_id INT,
177 statistic_id INT,
178 column_ids []INT,
179 created_at TIMESTAMP,
180 row_count INT,
181 cardinality INT,
182 null_values INT,
183 histogram BYTES,
184 PRIMARY KEY (table_id, statistic_id)
185 )
186 ```
187
188 Each table has zero or more rows in the `table_statistics` table
189 corresponding to the histograms maintained for the table.
190
191 * `table_id` is the ID of the table
192 * `statistic_id` is an ID that identifies each particular
193 statistic for a table.
194 * `column_ids` stores the IDs of the columns for which the statistic
195 is generated.
196 * `created_at` is the time at which the statistic was generated.
197 * `row_count` is the total number of rows in the table.
198 * `cardinality` is the estimated cardinality (on all columns).
199 * `null_values` is the number of rows that have a NULL on any
200 of the columns in `column_ids`; these rows don't contribute
201 to the cardinality.
202 * `histogram` is optional and can only be set if there is a single
203 column; it encodes a proto defined as:
204
205 ```
206 message HistogramData {
207 message Bucket {
208 // The estimated number of rows that are equal to upper_bound.
209 optional int64 eq_rows;
210
211 // The estimated number of rows in the bucket (excluding those
212 // that are equal to upper_bound). Splitting the count into two
213 // makes the histogram effectively equivalent to a histogram with
214 // twice as many buckets, with every other bucket containing a
215 // single value. This might be particularly advantageous if the
216 // histogram algorithm makes sure the top "heavy hitters" (most
217 // frequent elements) are bucket boundaries (similar of a
218 // compressed histogram).
219 optional int64 range_rows;
220
221 // The upper boundary of the bucket. The column values for the
222 // upper bound are encoded using the same ordered encoding used
223 // for index keys.
224 optional bytes upper_bound;
225 }
226
227 // Histogram buckets. Note that NULL values are excluded from the
228 // histogram.
229 repeated Bucket buckets;
230 }
231 ```
232
233 [TBD: using a proto for the histogram data is convenient but makes
234 ad-hoc querying of the stats data more difficult. We could store the
235 buckets in a series of parallel SQL arrays. Or we could store them in
236 a separate table.]
237
238 ### Estimating selectivity
239
240 Given a predicate such as `a > 1` and a histogram on `a`, how do we
241 compute the selectivity of the predicate? For range predicates such as
242 `>` and `<`, we determine which bucket the value lies in, interpolate
243 within the bucket to estimate the fraction of the bucket values that
244 match the predicate (assuming a uniform data distribution) and perform
245 a straightforward calculation to determine the fraction of values of
246 the entire histogram the predicate will allow through.
247
248 An equality predicate deserves special mention as it is common and
249 somewhat more challenging to estimate the selectivity for. The count
250 of distinct values in a bucket is assumed to have a uniform
251 distribution across the bucket. For example, let's consider a
252 histogram on `a` where the bucket containing the value `1` has a
253 `count` of `100` and a `distinct_count` of `2`. This means there were
254 `100` rows in the table with values for `a` that fell in the bucket,
255 but only `2` distinct values of `a`. We compute the selectivity of `a
256 = 1` as `100 / 2 == 50 rows` (an additional division by the table
257 cardinality can provide a fraction). In general, the selectivity for
258 equality is `bucket_count / bucket_distinct_count`.
259
260 ### Collecting statistics
261
262 Consider the table:
263
264 ```
265 CREATE TABLE test (
266 k INT PRIMARY KEY,
267 v STRING,
268 INDEX (v, k)
269 )
270 ```
271
272 At a high-level, we will maintain statistics about each column and
273 each tuple of columns composing an index. For the above table, that
274 translates into statistics about `k`, `v` and the tuple `(v, k)`. The
275 per-column statistics allow estimation of the selectivity of a
276 predicate on that column. Similarly, the per-index statistics allow
277 estimation of the selectivity of a predicate on the columns in the
278 index. Note that in general there are vastly more possible
279 permutations and combinations of columns that we don't maintain
280 statistics on than indexes. The intuition behind collecting statistics
281 on indexes is that the existence of the index is a hint that the table
282 will be queried in such a way that the columns in the index will be
283 used.
284
285 The collection of statistics for a table span will be provided by a
286 `Sampler` processor. This processor receives rows from a TableReader
287 and:
288 - calculates sketches for estimating the density vector, and
289 - selects a random sample of rows of a given size; it does this by
290 generating a random number for each row and choosing the rows with
291 the top K values. The random values are returned with each row,
292 allowing multiple sample sets to be combined in a single set.
293
294 ```
295 enum SketchType {
296 HLL_PLUS_PLUS_v1 = 0
297 }
298
299 // SamplerSpec is the specification of a "sampler" processor which
300 // returns a sample (random subset) of the input columns and computes
301 // cardinality estimation sketches on sets of columns.
302 //
303 // The sampler is configured with a sample size and sets of columns
304 // for the sketches. It produces one row with global statistics, one
305 // row with sketch information for each sketch plus at most
306 // sample_size sampled rows.
307 //
308 // The internal schema of the processor is formed of two column
309 // groups:
310 // 1. sampled row columns:
311 // - columns that map 1-1 to the columns in the input (same
312 // schema as the input).
313 // - an INT column with the random value associated with the row
314 // (this is necessary for combining sample sets).
315 // 2. sketch columns:
316 // - an INT column indicating the sketch index
317 // (0 to len(sketches) - 1).
318 // - an INT column indicating the number of rows processed
319 // - an INT column indicating the number of NULL values
320 // on the first column of the sketch.
321 // - a BYTES column with the binary sketch data (format
322 // dependent on the sketch type).
323 // Rows have NULLs on either all the sampled row columns or on all the
324 // sketch columns.
325 message SamplerSpec {
326 optional uint32 sample_size;
327
328 message SketchSpec {
329 optional SketchType sketch_type;
330
331 // Each value is an index identifying a column in the input stream.
332 repeated uint32 columns;
333 }
334 repeated SketchSpec sketches;
335 }
336 ```
337
338 A different `SampleAggregator` processor aggregates the results from
339 multiple Samplers and generates the histogram and other statistics
340 data. The processor is configured to populate the relevant rows in
341 `system.table_statistics`.
342
343 ```
344 message SampleAggregator {
345 optional sqlbase.TableDescriptor table;
346
347 // The processor merges reservoir sample sets into a single
348 // sample set of this size. This must match the sample size
349 // used for each Sampler.
350 optional uint32 sample_size;
351
352 message SketchSpec {
353 optional SketchType sketch_type;
354
355 // Each value is a sqlbase.ColumnID.
356 repeated uint32 column_ids;
357
358 // If set, we generate a histogram for the first column in the sketch.
359 optional bool generate_histogram;
360 }
361 repeated SketchSpec sketches;
362
363 // The i-th value indicates the column ID of the i-th sampled row
364 // column.
365 repeated uint32 sampled_column_ids;
366
367 // Indicates the columns for which we want to generate histograms.
368 // Must be a subset of the sampled column IDs.
369 repeated uint32 histogram_column_ids;
370 }
371 ```
372
373 The SampleAggregator combines the samples into a single sample set (by
374 choosing the rows with the top K random values across all the sets);
375 the histogram is constructed from this sample set. In terms of
376 implementation, both the Sampler and SampleAggregator can use the same
377 method of obtaining the top-K rows. Some efficient methods:
378 - maintain the top K rows in a heap. Updating the heap is a
379 logarithmic operation, but it only needs to happen when we find a
380 new top element (for the initial sampling, we have only O(KlogK)
381 operations).
382 - maintain the top 2K (or 1.5K) rows; when the set is full, select
383 the Kth element (which takes O(K)) and trim the set down to K.
384
385 This RFC is not proposing a specific algorithm for constructing the
386 histogram. The literature indicates that a very good choice is the
387 Max-Diff histogram with the "area" diff metric; this has a fairly
388 simple construction algorithm. We may implement a simpler (e.g.
389 equi-depth histogram) as a first iteration. Note that how we generate
390 the histogram does not affect the histogram data format, so this can
391 be changed arbitrarily.
392
393 [TBD: How many buckets to use for the histogram? SQL Server uses at
394 most 200].
395
396 Good choices for the sketch algorithm are
397 [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
398 and its improved version
399 [HLL++](http://research.google.com/pubs/pub40671.html). There are a
400 few go implementations for both on github with permissive licenses we
401 can use. The infrastructure allows adding better sketch algorithms later.
402
403 A DistSQL plan for this operation will look like this:
404 ![Sample plan](images/stats-distsql-plan.png?raw=true "Logical Plan")
405
406 ### Stats cache
407
408 To cache statistics, we emply two caches:
409 1. *statistics cache*: at the level of table and stores information
410 for all statistics available for each table. The information
411 excludes the histogram data.
412 2. *histogram* cache: each entry is a `(table_id, statistic_id)`
413 pair and stores the full statistic information, including the
414 histogram data.
415
416 The statistics cache is populated whenever we need the statistics for
417 a table. It gets populated by reading the relevant rows from
418 `table_statistics`; whenever this happens, we can choose to add the
419 histograms to the histogram cache as well (since we're scanning that
420 data anyway). The statistics cache can hold a large number of tables,
421 as the metadata it stores is fairly small.
422
423 The histogram cache is more limited because histograms are expected to
424 be 5-10KB in size. Whenever a histogram is required, the cache is
425 populated by reading the one row from `table_statistics`.
426
427 TBD: how do we update the statistics cache when a new statistic is
428 available? One option is to expire entries from the statistics
429 periodically to force a rescan; another option is to gossip a key
430 whenever a new table statistic is available.
431
432 The API for accessing the cache mirrors the two levels of caching:
433 there will be a function to obtain information (metadata) for all
434 statistics of a table, and a function to obtain the full information
435 for a given statistic. We can also have a function that returns all
436 statistics that include certain columns (which can potentially be more
437 efficient than retrieving each histogram separately).
438
439 ## Drawbacks
440
441 * There are number of items left for Future Work. It is possible some
442 of these should be addressed sooner.
443
444 * Are there any additional simplifications that could be made for the
445 initial implementation which wouldn't constrain Future Work?
446
447 ## Rationale and Alternatives
448
449 * Range-level statistics could be computed by a new KV operation.
450 While this may yield some performance benefits, it would require
451 pushing down more knowledge of row encoding, as well as the sketch
452 algorithms themselves. The Sampler processor model is quite a bit
453 cleaner in terms of separating the implementation.
454
455 * Range-level statistics could be computed by a new storage-level
456 `StatsQueue` that periodically refreshes the stats for a
457 range. Pacing of the queue could be driven by the number of
458 adds/deletes to the range since the last refresh. We'd want the
459 range-level stats to generate histograms because back of the
460 envelope calculations indicate that storing samples for
461 cluster-level aggregation would be too large. Storing histograms at
462 the range-level might also be too large as we'd have one histogram
463 per column which could get expensive for rows with lots of columns.
464
465 - How to combine range-level histograms into cluster-level
466 histograms?
467
468 - Is it problematic for the range-level stats for different ranges of
469 a table to be gathered at different times? How does that affect
470 later cluster-level aggregation?
471
472 - We split ranges at table boundaries, but not at index boundaries. So
473 a range may have to provide more than 1 set of statistics. The
474 range-level statistics collection will probably need to know about
475 the format of keys to identify table/index boundaries.
476
477 * Range-level statistics could be gathered during RocksDB
478 compactions. The upside of doing so is that the I/O is essentially
479 free. The downside is that we'd need to perform the range-level
480 statistics gathering in C++. This decision seems independent of much
481 of the rest of the machinery. Nothing precludes us from gathering
482 range-level statistics during RocksDB compactions in the future.
483
484 * The centralized stats table provides a new failure mode. If the
485 stats table is unavailable we would proceed without the stats but
486 doing so might make a query run orders of magnitude slower. An
487 alternative is to store a table's stats next to the table data
488 (e.g. in a separate range). We can then require the stats are
489 present for complex queries (e.g. queries for which the best plan,
490 without stats, has very high cost); this is similar to the existing
491 failure mode where if you lose some ranges from a table, you might
492 not be able to run queries on that table. The concern about the new
493 failure mode is not new to the stats table: most of the system
494 metadata tables have the same concern. Storing the stats for a table
495 near the table data could be done using a special key
496 (e.g. `/Table/TableID/MaxIndexID`). This would help the
497 backup/restore scenario as well.
498
499 * There are other ways to obtain and combine reservoir sets, e.g.
500 [this blog post](https://ballsandbins.wordpress.com/2014/04/13/distributedparallel-reservoir-sampling/)
501 Note though that we would still need to pass the sampling rate with
502 each row; otherwise the SampleAggregator can't tell which rows are
503 coming from which set. The "top K random numbers" idea is cleaner.
504
505 ## Future work
506
507
508 * Determining when and what stats to collect. We will probably start
509 with an explicit command for calculating statistics. Later we can
510 work on automatic stat generation, and automatically choosing which
511 statistics to gather. Columns that form indexes are a good signal
512 that we may see constraints on those columns, so we can start by
513 choosing those columns (or column groups).
514
515 * We don't need stats for every column in a table. For example,
516 consider the "second line" of an address, the "address complement".
517 This is the kind of column that mostly gets printed, and seldom gets
518 queried. Determining which columns to collect stats on is a tricky
519 problem. Perhaps there should be feedback from query execution to
520 indicate which columns and groups of columns are being used in
521 predicates. Perhaps there should be DBA control to indicate that
522 stats on certain columns should not be collected.
523
524 * For columns that are already sorted (i.e. the first column in an
525 index), we can build the histogram directly instead of using a
526 sample. Doing so can generate a more accurate histogram. We're
527 leaving this to future work as it can be considered purely an
528 optimization.
529
530 * Provide support for historical stats in order to better optimize `AS
531 OF SYSTEM TIME` queries. The relatively rarity of such queries is
532 motivating not supporting them initially and to just use the current
533 stats.
534
535 * While executing a query we might discover that the expectations we
536 inferred from the statistics are inaccurate. At a minimum, a
537 significant discrepancy should trigger a stats refresh. The actual
538 data retrieved by the query could also be used to update the
539 statistics, though this complicates normal query execution. See also
540 [self-tuning
541 histograms](https://ashraf.aboulnaga.me/pubs/sigmod99sthist.pdf).
542
543 * If the stats haven't been updated in a long time and we happen to be
544 doing a full table-range scan for some query, we might as well
545 recalculate the stats too in the same pass.
546
547 * Implement a strategy for detecting "heavy hitters" (a.k.a. the N
548 most frequent values). The literature says this can be done by
549 performing a second pass of sampling. The first pass performs a
550 normal reservoir sample. The N most frequent values are very likely
551 to appear in this sample, so sort the sample by frequency and
552 perform a second pass where we compute the exact counts for the N
553 most frequent values in the first pass sample. The "heavy hitters"
554 would be subtracted from the histogram and maintained in a separate
555 list. Note that a second pass over the table data would also allow a
556 much better estimation of the distinct values per histogram
557 bucket. The downside is that we'd be performing a second pass over
558 the table which would likely double the cost of stats
559 collection. Perhaps we can find a query driven means of determining
560 whether this second pass is worthwhile.
561
562 * For interleaved tables, we can optimize the stats work so that we
563 collect the stats for both the parent and interleaved table at the
564 same time. It is possible that this will not fall into future work
565 but be required in the initial implementation in order to support
566 interleaved tables.
567
568 ## Unresolved questions
569
570 * How is the central coordinator selected? Should there be a different
571 coordinator per-table? How is a request to analyze a table directed
572 to the central coordinator?
573
574 * Should stats be segregated by partition? Doing so could help if we
575 expect significant partition-specific data distributions. Would
576 doing so make cross-partition queries more difficult to optimize?
577 See the [table partitioning](20170208_sql_partitioning.md)
578 RFC. Segregating stats by partition might make repartitioning more
579 difficult. Or perhaps it makes stats inaccurate for a period of time
580 after repartitioning. If partitions do not have a partition ID (one
581 of the proposals) we'll need to find some way to identify the
582 partition for stats storage.
583
584 * What if a column has very large values? Perhaps we want to adjust
585 the number of buckets depending on that.