github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170908_sql_optimizer_statistics.md (about) 1 - Feature Name: SQL Optimizer Statistics 2 - Status: in-progress 3 - Start Date: 2017-09-08 4 - Authors: Peter Mattis, Radu Berinde 5 - RFC PR: [#18399](https:////github.com/cockroachdb/cockroach/pull/18399) 6 - Cockroach Issue: (one or more # from the issue tracker) 7 8 # Summary 9 10 This RFC describes the motivations and mechanisms for collecting 11 statistics for use in powering SQL optimization decisions. The 12 potential performance impact is very large as the decisions an 13 optimizer makes can provide orders of magnitude speed-ups to queries. 14 15 # Motivation 16 17 Modern SQL optimizers seek the lowest cost for a query. The cost for a 18 query is usually related to time, but is usually not directly 19 expressed in units of time. For example, the cost might be in units of 20 disk seeks or RPCs or some unnamed unit related to I/O. A cost model 21 estimates the cost for a particular query plan and the optimizer seeks 22 to find the lowest cost plan from among many alternatives. The input 23 to the cost model are the query plan and statistics about the table 24 data that can guide selection between alternatives. 25 26 One example of where statistics guide query optimization is in join 27 ordering. Consider the natural join: 28 29 `SELECT * FROM a NATURAL JOIN b` 30 31 In the absence of other opportunities, this might be implemented as a 32 hash join. With a hash join, we want to load the smaller set of rows 33 (either from `a` or `b`) into the hash table and then query that table 34 while looping through the larger set of rows. How do we know whether 35 `a` or `b` is larger? We keep statistics about the cardinality of `a` 36 and `b`. 37 38 Simple table cardinality is sufficient for the above query but fails 39 in other queries. Consider: 40 41 `SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10` 42 43 Table statistics might indicate that `a` contains 10x more data than 44 `b`, but the predicate `a.y > 10` is filtering a chunk of the 45 table. What we care about is whether the result of the scan of `a` 46 after filtering returns more rows than the scan of `b`. This can be 47 accomplished by making a determination of the selectivity of the 48 predicate `a.y > 10` and then multiplying that selectivity by the 49 cardinality of `a`. The common technique for estimating selectivity is 50 to collect a histogram on `a.y` (prior to running the query). 51 52 A final example is the query: 53 54 `SELECT * FROM system.rangelog WHERE rangeID = ? ORDER BY timestamp DESC LIMIT 100` 55 56 Currently, `system.rangelog` does not have an index on `rangeID`, but 57 does have an index (the primary key) on `timestamp`. This query is 58 planned so that it sequentially walks through the ranges in descending 59 timestamp order until it fills up the limit. If there are a large 60 number of ranges in the system, the selectivity of `rangeID = ?` will 61 be low and we can make an estimation of how many ranges we'll need to 62 query in order to find 100 rows. Rather than walking through the 63 ranges sequentially, we can query batches of ranges concurrently where 64 the size of batches is calculated from the expected number of matches 65 per range. 66 67 These examples are simple and clear. The academic literature is 68 littered with additional details. How do you estimate the selectivity 69 of multiple conjunctive or disjunctive predicates? How do you estimate 70 the number of rows output from various operators such as joins and 71 aggregations? Some of these questions are beyond the scope of this RFC 72 which is focused on the collection of basic statistics. 73 74 # Terminology 75 76 * Selectivity: The number of values that pass a predicate divided by 77 the total number of values. The selectivity multiplied by the 78 cardinality can be used to compute the total number of values after 79 applying the predicate to set of values. The selectivity is 80 interesting independent of the number of values because 81 selectivities for multiple predicates can be combined. 82 83 * Histogram: Histograms are the classic data structure for computing 84 selectivity. One kind that is frequently used is the equi-depth 85 histogram, where each bucket contains approximately (or exactly) the 86 same number of values. More complex histograms exist that aim to 87 provide more accuracy in various data distributions, e.g. "max-diff" 88 histograms. A simple one-pass algorithm exists for constructing an 89 equi-depth histogram for ordered values. A histogram for unordered 90 values can be constructed by first sampling the data (e.g. using 91 reservoir sampling), sorting it and then using the one-pass 92 algorithm on the sorted data. 93 94 * Cardinality: the number of distinct values (on a single column, or 95 on a group of columns). 96 97 * Sketch: Used for cardinality estimation. Algorithms such as 98 [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) 99 use a hash of the value to estimate the number of distinct values. 100 Sketches can be merged allowing for parallel computation. For 101 example, the sketches for distinct values on different ranges can be 102 combined into a single sketch for an entire table. 103 104 # Design considerations 105 106 * Statistics need to be available on every node performing query 107 optimization (i.e. every node receiving SQL queries). 108 109 * Statistics collection needs to be low overhead but not necessarily 110 the lowest overhead. Statistics do not need to be updated in real 111 time. 112 113 * Statistics collection should be decoupled from consumption. A simple 114 initial implementation of statistics collection should not preclude 115 more complex collection in the future. 116 117 * The desired statistics to collect are: 118 - a histogram for a given column 119 - a cardinality for a given set of columns 120 121 * The count of distinct values for a column or group of columns can be 122 computed using a sketch algorithm such as 123 [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). Such 124 sketches can be merged making it feasible to collect a sketch at the 125 range-level and combine the sketches to create a table-level sketch. 126 127 * Optimizing `AS OF SYSTEM TIME` queries implies being able to 128 retrieve statistics at a point in time. The statistics do not need 129 to be perfectly accurate, but should be relatively close. On the 130 other hand, historical queries are relatively rare right now. 131 132 # Design overview 133 134 This RFC focuses on the infrastructure and does not set a policy for 135 what stats to collect or when to collect them; these are discussed 136 under future work. 137 138 A single node per table will act as the central coordinator for 139 gathering table-level stats. This node sets up a DistSQL operation 140 that involves reading the primary index and sampling rows (as well as 141 computing sketches). While performed at a specific timestamp, the 142 table read operations explicitly skip the timestamp cache in order to 143 avoid disturbing foreground traffic. 144 145 [TBD: which node acts at the stat coordinator for a table? Presumably 146 there is some sort of lease based on the table ID. Can possibly reuse 147 or copy whatever mechanism is used for schema changes.] 148 149 A new system table provides storage for stats: 150 `system.table_statistics`. This table contains the count of distinct 151 values, NULL values and the histogram buckets for a column/index. 152 153 During query optimization, the stats for a table are retrieved from a 154 per-node stats cache. The cache is populated on demand from the 155 `system.table_statistics` table and refreshed periodically. 156 157 Pseudo-stats are used whenever actual statistics are unavailable. For 158 example, the selectivity of a less-then or greater-than predicate is 159 estimated as 1/3. The academic literature provides additional 160 heuristics to use when stats are unavailable. 161 162 An Admin UI endpoint is provided for visualizing the statistics for a 163 table. 164 165 ## Detailed design 166 167 ### `system.table_statistics` 168 169 The table-level statistics are stored in a new 170 `system.table_statistics` table. This table contains a row per 171 attribute or group of attributes that we maintain stats for. The 172 schema is: 173 174 ``` 175 CREATE TABLE system.table_statistics ( 176 table_id INT, 177 statistic_id INT, 178 column_ids []INT, 179 created_at TIMESTAMP, 180 row_count INT, 181 cardinality INT, 182 null_values INT, 183 histogram BYTES, 184 PRIMARY KEY (table_id, statistic_id) 185 ) 186 ``` 187 188 Each table has zero or more rows in the `table_statistics` table 189 corresponding to the histograms maintained for the table. 190 191 * `table_id` is the ID of the table 192 * `statistic_id` is an ID that identifies each particular 193 statistic for a table. 194 * `column_ids` stores the IDs of the columns for which the statistic 195 is generated. 196 * `created_at` is the time at which the statistic was generated. 197 * `row_count` is the total number of rows in the table. 198 * `cardinality` is the estimated cardinality (on all columns). 199 * `null_values` is the number of rows that have a NULL on any 200 of the columns in `column_ids`; these rows don't contribute 201 to the cardinality. 202 * `histogram` is optional and can only be set if there is a single 203 column; it encodes a proto defined as: 204 205 ``` 206 message HistogramData { 207 message Bucket { 208 // The estimated number of rows that are equal to upper_bound. 209 optional int64 eq_rows; 210 211 // The estimated number of rows in the bucket (excluding those 212 // that are equal to upper_bound). Splitting the count into two 213 // makes the histogram effectively equivalent to a histogram with 214 // twice as many buckets, with every other bucket containing a 215 // single value. This might be particularly advantageous if the 216 // histogram algorithm makes sure the top "heavy hitters" (most 217 // frequent elements) are bucket boundaries (similar of a 218 // compressed histogram). 219 optional int64 range_rows; 220 221 // The upper boundary of the bucket. The column values for the 222 // upper bound are encoded using the same ordered encoding used 223 // for index keys. 224 optional bytes upper_bound; 225 } 226 227 // Histogram buckets. Note that NULL values are excluded from the 228 // histogram. 229 repeated Bucket buckets; 230 } 231 ``` 232 233 [TBD: using a proto for the histogram data is convenient but makes 234 ad-hoc querying of the stats data more difficult. We could store the 235 buckets in a series of parallel SQL arrays. Or we could store them in 236 a separate table.] 237 238 ### Estimating selectivity 239 240 Given a predicate such as `a > 1` and a histogram on `a`, how do we 241 compute the selectivity of the predicate? For range predicates such as 242 `>` and `<`, we determine which bucket the value lies in, interpolate 243 within the bucket to estimate the fraction of the bucket values that 244 match the predicate (assuming a uniform data distribution) and perform 245 a straightforward calculation to determine the fraction of values of 246 the entire histogram the predicate will allow through. 247 248 An equality predicate deserves special mention as it is common and 249 somewhat more challenging to estimate the selectivity for. The count 250 of distinct values in a bucket is assumed to have a uniform 251 distribution across the bucket. For example, let's consider a 252 histogram on `a` where the bucket containing the value `1` has a 253 `count` of `100` and a `distinct_count` of `2`. This means there were 254 `100` rows in the table with values for `a` that fell in the bucket, 255 but only `2` distinct values of `a`. We compute the selectivity of `a 256 = 1` as `100 / 2 == 50 rows` (an additional division by the table 257 cardinality can provide a fraction). In general, the selectivity for 258 equality is `bucket_count / bucket_distinct_count`. 259 260 ### Collecting statistics 261 262 Consider the table: 263 264 ``` 265 CREATE TABLE test ( 266 k INT PRIMARY KEY, 267 v STRING, 268 INDEX (v, k) 269 ) 270 ``` 271 272 At a high-level, we will maintain statistics about each column and 273 each tuple of columns composing an index. For the above table, that 274 translates into statistics about `k`, `v` and the tuple `(v, k)`. The 275 per-column statistics allow estimation of the selectivity of a 276 predicate on that column. Similarly, the per-index statistics allow 277 estimation of the selectivity of a predicate on the columns in the 278 index. Note that in general there are vastly more possible 279 permutations and combinations of columns that we don't maintain 280 statistics on than indexes. The intuition behind collecting statistics 281 on indexes is that the existence of the index is a hint that the table 282 will be queried in such a way that the columns in the index will be 283 used. 284 285 The collection of statistics for a table span will be provided by a 286 `Sampler` processor. This processor receives rows from a TableReader 287 and: 288 - calculates sketches for estimating the density vector, and 289 - selects a random sample of rows of a given size; it does this by 290 generating a random number for each row and choosing the rows with 291 the top K values. The random values are returned with each row, 292 allowing multiple sample sets to be combined in a single set. 293 294 ``` 295 enum SketchType { 296 HLL_PLUS_PLUS_v1 = 0 297 } 298 299 // SamplerSpec is the specification of a "sampler" processor which 300 // returns a sample (random subset) of the input columns and computes 301 // cardinality estimation sketches on sets of columns. 302 // 303 // The sampler is configured with a sample size and sets of columns 304 // for the sketches. It produces one row with global statistics, one 305 // row with sketch information for each sketch plus at most 306 // sample_size sampled rows. 307 // 308 // The internal schema of the processor is formed of two column 309 // groups: 310 // 1. sampled row columns: 311 // - columns that map 1-1 to the columns in the input (same 312 // schema as the input). 313 // - an INT column with the random value associated with the row 314 // (this is necessary for combining sample sets). 315 // 2. sketch columns: 316 // - an INT column indicating the sketch index 317 // (0 to len(sketches) - 1). 318 // - an INT column indicating the number of rows processed 319 // - an INT column indicating the number of NULL values 320 // on the first column of the sketch. 321 // - a BYTES column with the binary sketch data (format 322 // dependent on the sketch type). 323 // Rows have NULLs on either all the sampled row columns or on all the 324 // sketch columns. 325 message SamplerSpec { 326 optional uint32 sample_size; 327 328 message SketchSpec { 329 optional SketchType sketch_type; 330 331 // Each value is an index identifying a column in the input stream. 332 repeated uint32 columns; 333 } 334 repeated SketchSpec sketches; 335 } 336 ``` 337 338 A different `SampleAggregator` processor aggregates the results from 339 multiple Samplers and generates the histogram and other statistics 340 data. The processor is configured to populate the relevant rows in 341 `system.table_statistics`. 342 343 ``` 344 message SampleAggregator { 345 optional sqlbase.TableDescriptor table; 346 347 // The processor merges reservoir sample sets into a single 348 // sample set of this size. This must match the sample size 349 // used for each Sampler. 350 optional uint32 sample_size; 351 352 message SketchSpec { 353 optional SketchType sketch_type; 354 355 // Each value is a sqlbase.ColumnID. 356 repeated uint32 column_ids; 357 358 // If set, we generate a histogram for the first column in the sketch. 359 optional bool generate_histogram; 360 } 361 repeated SketchSpec sketches; 362 363 // The i-th value indicates the column ID of the i-th sampled row 364 // column. 365 repeated uint32 sampled_column_ids; 366 367 // Indicates the columns for which we want to generate histograms. 368 // Must be a subset of the sampled column IDs. 369 repeated uint32 histogram_column_ids; 370 } 371 ``` 372 373 The SampleAggregator combines the samples into a single sample set (by 374 choosing the rows with the top K random values across all the sets); 375 the histogram is constructed from this sample set. In terms of 376 implementation, both the Sampler and SampleAggregator can use the same 377 method of obtaining the top-K rows. Some efficient methods: 378 - maintain the top K rows in a heap. Updating the heap is a 379 logarithmic operation, but it only needs to happen when we find a 380 new top element (for the initial sampling, we have only O(KlogK) 381 operations). 382 - maintain the top 2K (or 1.5K) rows; when the set is full, select 383 the Kth element (which takes O(K)) and trim the set down to K. 384 385 This RFC is not proposing a specific algorithm for constructing the 386 histogram. The literature indicates that a very good choice is the 387 Max-Diff histogram with the "area" diff metric; this has a fairly 388 simple construction algorithm. We may implement a simpler (e.g. 389 equi-depth histogram) as a first iteration. Note that how we generate 390 the histogram does not affect the histogram data format, so this can 391 be changed arbitrarily. 392 393 [TBD: How many buckets to use for the histogram? SQL Server uses at 394 most 200]. 395 396 Good choices for the sketch algorithm are 397 [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) 398 and its improved version 399 [HLL++](http://research.google.com/pubs/pub40671.html). There are a 400 few go implementations for both on github with permissive licenses we 401 can use. The infrastructure allows adding better sketch algorithms later. 402 403 A DistSQL plan for this operation will look like this: 404  405 406 ### Stats cache 407 408 To cache statistics, we emply two caches: 409 1. *statistics cache*: at the level of table and stores information 410 for all statistics available for each table. The information 411 excludes the histogram data. 412 2. *histogram* cache: each entry is a `(table_id, statistic_id)` 413 pair and stores the full statistic information, including the 414 histogram data. 415 416 The statistics cache is populated whenever we need the statistics for 417 a table. It gets populated by reading the relevant rows from 418 `table_statistics`; whenever this happens, we can choose to add the 419 histograms to the histogram cache as well (since we're scanning that 420 data anyway). The statistics cache can hold a large number of tables, 421 as the metadata it stores is fairly small. 422 423 The histogram cache is more limited because histograms are expected to 424 be 5-10KB in size. Whenever a histogram is required, the cache is 425 populated by reading the one row from `table_statistics`. 426 427 TBD: how do we update the statistics cache when a new statistic is 428 available? One option is to expire entries from the statistics 429 periodically to force a rescan; another option is to gossip a key 430 whenever a new table statistic is available. 431 432 The API for accessing the cache mirrors the two levels of caching: 433 there will be a function to obtain information (metadata) for all 434 statistics of a table, and a function to obtain the full information 435 for a given statistic. We can also have a function that returns all 436 statistics that include certain columns (which can potentially be more 437 efficient than retrieving each histogram separately). 438 439 ## Drawbacks 440 441 * There are number of items left for Future Work. It is possible some 442 of these should be addressed sooner. 443 444 * Are there any additional simplifications that could be made for the 445 initial implementation which wouldn't constrain Future Work? 446 447 ## Rationale and Alternatives 448 449 * Range-level statistics could be computed by a new KV operation. 450 While this may yield some performance benefits, it would require 451 pushing down more knowledge of row encoding, as well as the sketch 452 algorithms themselves. The Sampler processor model is quite a bit 453 cleaner in terms of separating the implementation. 454 455 * Range-level statistics could be computed by a new storage-level 456 `StatsQueue` that periodically refreshes the stats for a 457 range. Pacing of the queue could be driven by the number of 458 adds/deletes to the range since the last refresh. We'd want the 459 range-level stats to generate histograms because back of the 460 envelope calculations indicate that storing samples for 461 cluster-level aggregation would be too large. Storing histograms at 462 the range-level might also be too large as we'd have one histogram 463 per column which could get expensive for rows with lots of columns. 464 465 - How to combine range-level histograms into cluster-level 466 histograms? 467 468 - Is it problematic for the range-level stats for different ranges of 469 a table to be gathered at different times? How does that affect 470 later cluster-level aggregation? 471 472 - We split ranges at table boundaries, but not at index boundaries. So 473 a range may have to provide more than 1 set of statistics. The 474 range-level statistics collection will probably need to know about 475 the format of keys to identify table/index boundaries. 476 477 * Range-level statistics could be gathered during RocksDB 478 compactions. The upside of doing so is that the I/O is essentially 479 free. The downside is that we'd need to perform the range-level 480 statistics gathering in C++. This decision seems independent of much 481 of the rest of the machinery. Nothing precludes us from gathering 482 range-level statistics during RocksDB compactions in the future. 483 484 * The centralized stats table provides a new failure mode. If the 485 stats table is unavailable we would proceed without the stats but 486 doing so might make a query run orders of magnitude slower. An 487 alternative is to store a table's stats next to the table data 488 (e.g. in a separate range). We can then require the stats are 489 present for complex queries (e.g. queries for which the best plan, 490 without stats, has very high cost); this is similar to the existing 491 failure mode where if you lose some ranges from a table, you might 492 not be able to run queries on that table. The concern about the new 493 failure mode is not new to the stats table: most of the system 494 metadata tables have the same concern. Storing the stats for a table 495 near the table data could be done using a special key 496 (e.g. `/Table/TableID/MaxIndexID`). This would help the 497 backup/restore scenario as well. 498 499 * There are other ways to obtain and combine reservoir sets, e.g. 500 [this blog post](https://ballsandbins.wordpress.com/2014/04/13/distributedparallel-reservoir-sampling/) 501 Note though that we would still need to pass the sampling rate with 502 each row; otherwise the SampleAggregator can't tell which rows are 503 coming from which set. The "top K random numbers" idea is cleaner. 504 505 ## Future work 506 507 508 * Determining when and what stats to collect. We will probably start 509 with an explicit command for calculating statistics. Later we can 510 work on automatic stat generation, and automatically choosing which 511 statistics to gather. Columns that form indexes are a good signal 512 that we may see constraints on those columns, so we can start by 513 choosing those columns (or column groups). 514 515 * We don't need stats for every column in a table. For example, 516 consider the "second line" of an address, the "address complement". 517 This is the kind of column that mostly gets printed, and seldom gets 518 queried. Determining which columns to collect stats on is a tricky 519 problem. Perhaps there should be feedback from query execution to 520 indicate which columns and groups of columns are being used in 521 predicates. Perhaps there should be DBA control to indicate that 522 stats on certain columns should not be collected. 523 524 * For columns that are already sorted (i.e. the first column in an 525 index), we can build the histogram directly instead of using a 526 sample. Doing so can generate a more accurate histogram. We're 527 leaving this to future work as it can be considered purely an 528 optimization. 529 530 * Provide support for historical stats in order to better optimize `AS 531 OF SYSTEM TIME` queries. The relatively rarity of such queries is 532 motivating not supporting them initially and to just use the current 533 stats. 534 535 * While executing a query we might discover that the expectations we 536 inferred from the statistics are inaccurate. At a minimum, a 537 significant discrepancy should trigger a stats refresh. The actual 538 data retrieved by the query could also be used to update the 539 statistics, though this complicates normal query execution. See also 540 [self-tuning 541 histograms](https://ashraf.aboulnaga.me/pubs/sigmod99sthist.pdf). 542 543 * If the stats haven't been updated in a long time and we happen to be 544 doing a full table-range scan for some query, we might as well 545 recalculate the stats too in the same pass. 546 547 * Implement a strategy for detecting "heavy hitters" (a.k.a. the N 548 most frequent values). The literature says this can be done by 549 performing a second pass of sampling. The first pass performs a 550 normal reservoir sample. The N most frequent values are very likely 551 to appear in this sample, so sort the sample by frequency and 552 perform a second pass where we compute the exact counts for the N 553 most frequent values in the first pass sample. The "heavy hitters" 554 would be subtracted from the histogram and maintained in a separate 555 list. Note that a second pass over the table data would also allow a 556 much better estimation of the distinct values per histogram 557 bucket. The downside is that we'd be performing a second pass over 558 the table which would likely double the cost of stats 559 collection. Perhaps we can find a query driven means of determining 560 whether this second pass is worthwhile. 561 562 * For interleaved tables, we can optimize the stats work so that we 563 collect the stats for both the parent and interleaved table at the 564 same time. It is possible that this will not fall into future work 565 but be required in the initial implementation in order to support 566 interleaved tables. 567 568 ## Unresolved questions 569 570 * How is the central coordinator selected? Should there be a different 571 coordinator per-table? How is a request to analyze a table directed 572 to the central coordinator? 573 574 * Should stats be segregated by partition? Doing so could help if we 575 expect significant partition-specific data distributions. Would 576 doing so make cross-partition queries more difficult to optimize? 577 See the [table partitioning](20170208_sql_partitioning.md) 578 RFC. Segregating stats by partition might make repartitioning more 579 difficult. Or perhaps it makes stats inaccurate for a period of time 580 after repartitioning. If partitions do not have a partition ID (one 581 of the proposals) we'll need to find some way to identify the 582 partition for stats storage. 583 584 * What if a column has very large values? Perhaps we want to adjust 585 the number of buckets depending on that.