github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170908_sql_optimizer_statistics.md (about)

     1  - Feature Name: SQL Optimizer Statistics
     2  - Status: in-progress
     3  - Start Date: 2017-09-08
     4  - Authors: Peter Mattis, Radu Berinde
     5  - RFC PR: [#18399](https:////github.com/cockroachdb/cockroach/pull/18399)
     6  - Cockroach Issue: (one or more # from the issue tracker)
     7  
     8  # Summary
     9  
    10  This RFC describes the motivations and mechanisms for collecting
    11  statistics for use in powering SQL optimization decisions. The
    12  potential performance impact is very large as the decisions an
    13  optimizer makes can provide orders of magnitude speed-ups to queries.
    14  
    15  # Motivation
    16  
    17  Modern SQL optimizers seek the lowest cost for a query. The cost for a
    18  query is usually related to time, but is usually not directly
    19  expressed in units of time. For example, the cost might be in units of
    20  disk seeks or RPCs or some unnamed unit related to I/O. A cost model
    21  estimates the cost for a particular query plan and the optimizer seeks
    22  to find the lowest cost plan from among many alternatives. The input
    23  to the cost model are the query plan and statistics about the table
    24  data that can guide selection between alternatives.
    25  
    26  One example of where statistics guide query optimization is in join
    27  ordering. Consider the natural join:
    28  
    29    `SELECT * FROM a NATURAL JOIN b`
    30  
    31  In the absence of other opportunities, this might be implemented as a
    32  hash join. With a hash join, we want to load the smaller set of rows
    33  (either from `a` or `b`) into the hash table and then query that table
    34  while looping through the larger set of rows. How do we know whether
    35  `a` or `b` is larger? We keep statistics about the cardinality of `a`
    36  and `b`.
    37  
    38  Simple table cardinality is sufficient for the above query but fails
    39  in other queries. Consider:
    40  
    41    `SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10`
    42  
    43  Table statistics might indicate that `a` contains 10x more data than
    44  `b`, but the predicate `a.y > 10` is filtering a chunk of the
    45  table. What we care about is whether the result of the scan of `a`
    46  after filtering returns more rows than the scan of `b`. This can be
    47  accomplished by making a determination of the selectivity of the
    48  predicate `a.y > 10` and then multiplying that selectivity by the
    49  cardinality of `a`. The common technique for estimating selectivity is
    50  to collect a histogram on `a.y` (prior to running the query).
    51  
    52  A final example is the query:
    53  
    54    `SELECT * FROM system.rangelog WHERE rangeID = ? ORDER BY timestamp DESC LIMIT 100`
    55  
    56  Currently, `system.rangelog` does not have an index on `rangeID`, but
    57  does have an index (the primary key) on `timestamp`. This query is
    58  planned so that it sequentially walks through the ranges in descending
    59  timestamp order until it fills up the limit. If there are a large
    60  number of ranges in the system, the selectivity of `rangeID = ?` will
    61  be low and we can make an estimation of how many ranges we'll need to
    62  query in order to find 100 rows. Rather than walking through the
    63  ranges sequentially, we can query batches of ranges concurrently where
    64  the size of batches is calculated from the expected number of matches
    65  per range.
    66  
    67  These examples are simple and clear. The academic literature is
    68  littered with additional details. How do you estimate the selectivity
    69  of multiple conjunctive or disjunctive predicates? How do you estimate
    70  the number of rows output from various operators such as joins and
    71  aggregations? Some of these questions are beyond the scope of this RFC
    72  which is focused on the collection of basic statistics.
    73  
    74  # Terminology
    75  
    76  * Selectivity: The number of values that pass a predicate divided by
    77    the total number of values. The selectivity multiplied by the
    78    cardinality can be used to compute the total number of values after
    79    applying the predicate to set of values. The selectivity is
    80    interesting independent of the number of values because
    81    selectivities for multiple predicates can be combined.
    82  
    83  * Histogram: Histograms are the classic data structure for computing
    84    selectivity. One kind that is frequently used is the equi-depth
    85    histogram, where each bucket contains approximately (or exactly) the
    86    same number of values. More complex histograms exist that aim to
    87    provide more accuracy in various data distributions, e.g. "max-diff"
    88    histograms. A simple one-pass algorithm exists for constructing an
    89    equi-depth histogram for ordered values. A histogram for unordered
    90    values can be constructed by first sampling the data (e.g. using
    91    reservoir sampling), sorting it and then using the one-pass
    92    algorithm on the sorted data.
    93  
    94  * Cardinality: the number of distinct values (on a single column, or
    95    on a group of columns).
    96  
    97  * Sketch: Used for cardinality estimation. Algorithms such as
    98    [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
    99    use a hash of the value to estimate the number of distinct values.
   100    Sketches can be merged allowing for parallel computation. For
   101    example, the sketches for distinct values on different ranges can be
   102    combined into a single sketch for an entire table.
   103  
   104  # Design considerations
   105  
   106  * Statistics need to be available on every node performing query
   107    optimization (i.e. every node receiving SQL queries).
   108  
   109  * Statistics collection needs to be low overhead but not necessarily
   110    the lowest overhead. Statistics do not need to be updated in real
   111    time.
   112  
   113  * Statistics collection should be decoupled from consumption. A simple
   114    initial implementation of statistics collection should not preclude
   115    more complex collection in the future.
   116  
   117  * The desired statistics to collect are:
   118      - a histogram for a given column
   119      - a cardinality for a given set of columns
   120  
   121  * The count of distinct values for a column or group of columns can be
   122    computed using a sketch algorithm such as
   123    [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). Such
   124    sketches can be merged making it feasible to collect a sketch at the
   125    range-level and combine the sketches to create a table-level sketch.
   126  
   127  * Optimizing `AS OF SYSTEM TIME` queries implies being able to
   128    retrieve statistics at a point in time. The statistics do not need
   129    to be perfectly accurate, but should be relatively close. On the
   130    other hand, historical queries are relatively rare right now.
   131  
   132  # Design overview
   133  
   134  This RFC focuses on the infrastructure and does not set a policy for
   135  what stats to collect or when to collect them; these are discussed
   136  under future work.
   137  
   138  A single node per table will act as the central coordinator for
   139  gathering table-level stats. This node sets up a DistSQL operation
   140  that involves reading the primary index and sampling rows (as well as
   141  computing sketches). While performed at a specific timestamp, the
   142  table read operations explicitly skip the timestamp cache in order to
   143  avoid disturbing foreground traffic.
   144  
   145  [TBD: which node acts at the stat coordinator for a table? Presumably
   146  there is some sort of lease based on the table ID. Can possibly reuse
   147  or copy whatever mechanism is used for schema changes.]
   148  
   149  A new system table provides storage for stats:
   150  `system.table_statistics`. This table contains the count of distinct
   151  values, NULL values and the histogram buckets for a column/index.
   152  
   153  During query optimization, the stats for a table are retrieved from a
   154  per-node stats cache. The cache is populated on demand from the
   155  `system.table_statistics` table and refreshed periodically.
   156  
   157  Pseudo-stats are used whenever actual statistics are unavailable. For
   158  example, the selectivity of a less-then or greater-than predicate is
   159  estimated as 1/3. The academic literature provides additional
   160  heuristics to use when stats are unavailable.
   161  
   162  An Admin UI endpoint is provided for visualizing the statistics for a
   163  table.
   164  
   165  ## Detailed design
   166  
   167  ### `system.table_statistics`
   168  
   169  The table-level statistics are stored in a new
   170  `system.table_statistics` table. This table contains a row per
   171  attribute or group of attributes that we maintain stats for. The
   172  schema is:
   173  
   174  ```
   175  CREATE TABLE system.table_statistics (
   176    table_id INT,
   177    statistic_id INT,
   178    column_ids []INT,
   179    created_at TIMESTAMP,
   180    row_count INT,
   181    cardinality INT,
   182    null_values INT,
   183    histogram BYTES,
   184    PRIMARY KEY (table_id, statistic_id)
   185  )
   186  ```
   187  
   188  Each table has zero or more rows in the `table_statistics` table
   189  corresponding to the histograms maintained for the table.
   190  
   191  * `table_id` is the ID of the table
   192  * `statistic_id` is an ID that identifies each particular
   193    statistic for a table.
   194  * `column_ids` stores the IDs of the columns for which the statistic
   195    is generated.
   196  * `created_at` is the time at which the statistic was generated.
   197  * `row_count` is the total number of rows in the table.
   198  * `cardinality` is the estimated cardinality (on all columns).
   199  * `null_values` is the number of rows that have a NULL on any
   200    of the columns in `column_ids`; these rows don't contribute
   201    to the cardinality.
   202  * `histogram` is optional and can only be set if there is a single
   203    column; it encodes a proto defined as:
   204  
   205  ```
   206  message HistogramData {
   207    message Bucket {
   208      // The estimated number of rows that are equal to upper_bound.
   209      optional int64 eq_rows;
   210  
   211      // The estimated number of rows in the bucket (excluding those
   212      // that are equal to upper_bound). Splitting the count into two
   213      // makes the histogram effectively equivalent to a histogram with
   214      // twice as many buckets, with every other bucket containing a
   215      // single value. This might be particularly advantageous if the
   216      // histogram algorithm makes sure the top "heavy hitters" (most
   217      // frequent elements) are bucket boundaries (similar of a
   218      // compressed histogram).
   219      optional int64 range_rows;
   220  
   221      // The upper boundary of the bucket. The column values for the
   222      // upper bound are encoded using the same ordered encoding used
   223      // for index keys.
   224      optional bytes upper_bound;
   225    }
   226  
   227    // Histogram buckets. Note that NULL values are excluded from the
   228    // histogram.
   229    repeated Bucket buckets;
   230  }
   231  ```
   232  
   233  [TBD: using a proto for the histogram data is convenient but makes
   234  ad-hoc querying of the stats data more difficult. We could store the
   235  buckets in a series of parallel SQL arrays. Or we could store them in
   236  a separate table.]
   237  
   238  ### Estimating selectivity
   239  
   240  Given a predicate such as `a > 1` and a histogram on `a`, how do we
   241  compute the selectivity of the predicate? For range predicates such as
   242  `>` and `<`, we determine which bucket the value lies in, interpolate
   243  within the bucket to estimate the fraction of the bucket values that
   244  match the predicate (assuming a uniform data distribution) and perform
   245  a straightforward calculation to determine the fraction of values of
   246  the entire histogram the predicate will allow through.
   247  
   248  An equality predicate deserves special mention as it is common and
   249  somewhat more challenging to estimate the selectivity for. The count
   250  of distinct values in a bucket is assumed to have a uniform
   251  distribution across the bucket. For example, let's consider a
   252  histogram on `a` where the bucket containing the value `1` has a
   253  `count` of `100` and a `distinct_count` of `2`. This means there were
   254  `100` rows in the table with values for `a` that fell in the bucket,
   255  but only `2` distinct values of `a`. We compute the selectivity of `a
   256  = 1` as `100 / 2 == 50 rows` (an additional division by the table
   257  cardinality can provide a fraction). In general, the selectivity for
   258  equality is `bucket_count / bucket_distinct_count`.
   259  
   260  ### Collecting statistics
   261  
   262  Consider the table:
   263  
   264  ```
   265  CREATE TABLE test (
   266    k INT PRIMARY KEY,
   267    v STRING,
   268    INDEX (v, k)
   269  )
   270  ```
   271  
   272  At a high-level, we will maintain statistics about each column and
   273  each tuple of columns composing an index. For the above table, that
   274  translates into statistics about `k`, `v` and the tuple `(v, k)`. The
   275  per-column statistics allow estimation of the selectivity of a
   276  predicate on that column. Similarly, the per-index statistics allow
   277  estimation of the selectivity of a predicate on the columns in the
   278  index. Note that in general there are vastly more possible
   279  permutations and combinations of columns that we don't maintain
   280  statistics on than indexes. The intuition behind collecting statistics
   281  on indexes is that the existence of the index is a hint that the table
   282  will be queried in such a way that the columns in the index will be
   283  used. 
   284  
   285  The collection of statistics for a table span will be provided by a
   286  `Sampler` processor. This processor receives rows from a TableReader
   287  and:
   288   - calculates sketches for estimating the density vector, and
   289   - selects a random sample of rows of a given size; it does this by
   290     generating a random number for each row and choosing the rows with
   291     the top K values. The random values are returned with each row,
   292     allowing multiple sample sets to be combined in a single set.
   293  
   294  ```
   295  enum SketchType {
   296    HLL_PLUS_PLUS_v1 = 0
   297  }
   298  
   299  // SamplerSpec is the specification of a "sampler" processor which
   300  // returns a sample (random subset) of the input columns and computes
   301  // cardinality estimation sketches on sets of columns.
   302  //
   303  // The sampler is configured with a sample size and sets of columns
   304  // for the sketches. It produces one row with global statistics, one
   305  // row with sketch information for each sketch plus at most
   306  // sample_size sampled rows.
   307  //
   308  // The internal schema of the processor is formed of two column
   309  // groups:
   310  //   1. sampled row columns:
   311  //       - columns that map 1-1 to the columns in the input (same
   312  //         schema as the input).
   313  //       - an INT column with the random value associated with the row
   314  //         (this is necessary for combining sample sets).
   315  //   2. sketch columns:
   316  //       - an INT column indicating the sketch index
   317  //         (0 to len(sketches) - 1).
   318  //       - an INT column indicating the number of rows processed
   319  //       - an INT column indicating the number of NULL values
   320  //         on the first column of the sketch.
   321  //       - a BYTES column with the binary sketch data (format
   322  //         dependent on the sketch type).
   323  // Rows have NULLs on either all the sampled row columns or on all the
   324  // sketch columns.
   325  message SamplerSpec {
   326    optional uint32 sample_size;
   327  
   328    message SketchSpec {
   329      optional SketchType sketch_type;
   330  
   331      // Each value is an index identifying a column in the input stream.
   332      repeated uint32 columns;
   333    }
   334    repeated SketchSpec sketches;
   335  }
   336  ```
   337  
   338  A different `SampleAggregator` processor aggregates the results from
   339  multiple Samplers and generates the histogram and other statistics
   340  data. The processor is configured to populate the relevant rows in
   341  `system.table_statistics`.
   342  
   343  ```
   344  message SampleAggregator {
   345    optional sqlbase.TableDescriptor table;
   346  
   347    // The processor merges reservoir sample sets into a single
   348    // sample set of this size. This must match the sample size
   349    // used for each Sampler.
   350    optional uint32 sample_size;
   351  
   352    message SketchSpec {
   353      optional SketchType sketch_type;
   354  
   355      // Each value is a sqlbase.ColumnID.
   356      repeated uint32 column_ids;
   357  
   358      // If set, we generate a histogram for the first column in the sketch.
   359      optional bool generate_histogram;
   360    }
   361    repeated SketchSpec sketches;
   362  
   363    // The i-th value indicates the column ID of the i-th sampled row
   364    // column.
   365    repeated uint32 sampled_column_ids;
   366  
   367    // Indicates the columns for which we want to generate histograms.
   368    // Must be a subset of the sampled column IDs.
   369    repeated uint32 histogram_column_ids;
   370  }
   371  ```
   372  
   373  The SampleAggregator combines the samples into a single sample set (by
   374  choosing the rows with the top K random values across all the sets);
   375  the histogram is constructed from this sample set. In terms of
   376  implementation, both the Sampler and SampleAggregator can use the same
   377  method of obtaining the top-K rows. Some efficient methods:
   378   - maintain the top K rows in a heap. Updating the heap is a
   379     logarithmic operation, but it only needs to happen when we find a
   380     new top element (for the initial sampling, we have only O(KlogK)
   381     operations).
   382   - maintain the top 2K (or 1.5K) rows; when the set is full, select
   383     the Kth element (which takes O(K)) and trim the set down to K.
   384  
   385  This RFC is not proposing a specific algorithm for constructing the
   386  histogram. The literature indicates that a very good choice is the
   387  Max-Diff histogram with the "area" diff metric; this has a fairly
   388  simple construction algorithm. We may implement a simpler (e.g.
   389  equi-depth histogram) as a first iteration. Note that how we generate
   390  the histogram does not affect the histogram data format, so this can
   391  be changed arbitrarily.
   392  
   393  [TBD: How many buckets to use for the histogram? SQL Server uses at
   394  most 200].
   395  
   396  Good choices for the sketch algorithm are
   397  [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
   398  and its improved version
   399  [HLL++](http://research.google.com/pubs/pub40671.html). There are a
   400  few go implementations for both on github with permissive licenses we
   401  can use. The infrastructure allows adding better sketch algorithms later. 
   402  
   403  A DistSQL plan for this operation will look like this:
   404  ![Sample plan](images/stats-distsql-plan.png?raw=true "Logical Plan")
   405  
   406  ### Stats cache
   407  
   408  To cache statistics, we emply two caches:
   409    1. *statistics cache*: at the level of table and stores information
   410       for all statistics available for each table. The information
   411       excludes the histogram data.
   412    2. *histogram* cache: each entry is a `(table_id, statistic_id)`
   413       pair and stores the full statistic information, including the
   414       histogram data.
   415   
   416  The statistics cache is populated whenever we need the statistics for
   417  a table. It gets populated by reading the relevant rows from
   418  `table_statistics`; whenever this happens, we can choose to add the
   419  histograms to the histogram cache as well (since we're scanning that
   420  data anyway). The statistics cache can hold a large number of tables,
   421  as the metadata it stores is fairly small.
   422   
   423  The histogram cache is more limited because histograms are expected to
   424  be 5-10KB in size. Whenever a histogram is required, the cache is
   425  populated by reading the one row from `table_statistics`.
   426   
   427  TBD: how do we update the statistics cache when a new statistic is
   428  available? One option is to expire entries from the statistics
   429  periodically to force a rescan; another option is to gossip a key
   430  whenever a new table statistic is available.
   431  
   432  The API for accessing the cache mirrors the two levels of caching:
   433  there will be a function to obtain information (metadata) for all
   434  statistics of a table, and a function to obtain the full information
   435  for a given statistic. We can also have a function that returns all
   436  statistics that include certain columns (which can potentially be more
   437  efficient than retrieving each histogram separately).
   438  
   439  ## Drawbacks
   440  
   441  * There are number of items left for Future Work. It is possible some
   442    of these should be addressed sooner.
   443  
   444  * Are there any additional simplifications that could be made for the
   445    initial implementation which wouldn't constrain Future Work?
   446  
   447  ## Rationale and Alternatives
   448  
   449  * Range-level statistics could be computed by a new KV operation.
   450    While this may yield some performance benefits, it would require
   451    pushing down more knowledge of row encoding, as well as the sketch
   452    algorithms themselves. The Sampler processor model is quite a bit
   453    cleaner in terms of separating the implementation.
   454  
   455  * Range-level statistics could be computed by a new storage-level
   456    `StatsQueue` that periodically refreshes the stats for a
   457    range. Pacing of the queue could be driven by the number of
   458    adds/deletes to the range since the last refresh. We'd want the
   459    range-level stats to generate histograms because back of the
   460    envelope calculations indicate that storing samples for
   461    cluster-level aggregation would be too large. Storing histograms at
   462    the range-level might also be too large as we'd have one histogram
   463    per column which could get expensive for rows with lots of columns.
   464  
   465    - How to combine range-level histograms into cluster-level
   466      histograms?
   467  
   468    - Is it problematic for the range-level stats for different ranges of
   469      a table to be gathered at different times? How does that affect
   470      later cluster-level aggregation?
   471  
   472    - We split ranges at table boundaries, but not at index boundaries. So
   473      a range may have to provide more than 1 set of statistics. The
   474      range-level statistics collection will probably need to know about
   475      the format of keys to identify table/index boundaries.
   476  
   477  * Range-level statistics could be gathered during RocksDB
   478    compactions. The upside of doing so is that the I/O is essentially
   479    free. The downside is that we'd need to perform the range-level
   480    statistics gathering in C++. This decision seems independent of much
   481    of the rest of the machinery. Nothing precludes us from gathering
   482    range-level statistics during RocksDB compactions in the future.
   483  
   484  * The centralized stats table provides a new failure mode. If the
   485    stats table is unavailable we would proceed without the stats but
   486    doing so might make a query run orders of magnitude slower. An
   487    alternative is to store a table's stats next to the table data
   488    (e.g. in a separate range). We can then require the stats are
   489    present for complex queries (e.g. queries for which the best plan,
   490    without stats, has very high cost); this is similar to the existing
   491    failure mode where if you lose some ranges from a table, you might
   492    not be able to run queries on that table. The concern about the new
   493    failure mode is not new to the stats table: most of the system
   494    metadata tables have the same concern. Storing the stats for a table
   495    near the table data could be done using a special key
   496    (e.g. `/Table/TableID/MaxIndexID`). This would help the
   497    backup/restore scenario as well.
   498  
   499  * There are other ways to obtain and combine reservoir sets, e.g.
   500    [this blog post](https://ballsandbins.wordpress.com/2014/04/13/distributedparallel-reservoir-sampling/)
   501    Note though that we would still need to pass the sampling rate with
   502    each row; otherwise the SampleAggregator can't tell which rows are
   503    coming from which set. The "top K random numbers" idea is cleaner.
   504  
   505  ## Future work
   506  
   507  
   508  * Determining when and what stats to collect. We will probably start
   509    with an explicit command for calculating statistics. Later we can
   510    work on automatic stat generation, and automatically choosing which
   511    statistics to gather. Columns that form indexes are a good signal
   512    that we may see constraints on those columns, so we can start by
   513    choosing those columns (or column groups).
   514  
   515  * We don't need stats for every column in a table. For example,
   516    consider the "second line" of an address, the "address complement".
   517    This is the kind of column that mostly gets printed, and seldom gets
   518    queried. Determining which columns to collect stats on is a tricky
   519    problem. Perhaps there should be feedback from query execution to
   520    indicate which columns and groups of columns are being used in
   521    predicates. Perhaps there should be DBA control to indicate that
   522    stats on certain columns should not be collected.
   523  
   524  * For columns that are already sorted (i.e. the first column in an
   525    index), we can build the histogram directly instead of using a
   526    sample. Doing so can generate a more accurate histogram. We're
   527    leaving this to future work as it can be considered purely an
   528    optimization.
   529  
   530  * Provide support for historical stats in order to better optimize `AS
   531    OF SYSTEM TIME` queries. The relatively rarity of such queries is
   532    motivating not supporting them initially and to just use the current
   533    stats.
   534  
   535  * While executing a query we might discover that the expectations we
   536    inferred from the statistics are inaccurate. At a minimum, a
   537    significant discrepancy should trigger a stats refresh. The actual
   538    data retrieved by the query could also be used to update the
   539    statistics, though this complicates normal query execution. See also
   540    [self-tuning
   541    histograms](https://ashraf.aboulnaga.me/pubs/sigmod99sthist.pdf).
   542  
   543  * If the stats haven't been updated in a long time and we happen to be
   544    doing a full table-range scan for some query, we might as well
   545    recalculate the stats too in the same pass.
   546  
   547  * Implement a strategy for detecting "heavy hitters" (a.k.a. the N
   548    most frequent values). The literature says this can be done by
   549    performing a second pass of sampling. The first pass performs a
   550    normal reservoir sample. The N most frequent values are very likely
   551    to appear in this sample, so sort the sample by frequency and
   552    perform a second pass where we compute the exact counts for the N
   553    most frequent values in the first pass sample. The "heavy hitters"
   554    would be subtracted from the histogram and maintained in a separate
   555    list. Note that a second pass over the table data would also allow a
   556    much better estimation of the distinct values per histogram
   557    bucket. The downside is that we'd be performing a second pass over
   558    the table which would likely double the cost of stats
   559    collection. Perhaps we can find a query driven means of determining
   560    whether this second pass is worthwhile.
   561  
   562  * For interleaved tables, we can optimize the stats work so that we
   563    collect the stats for both the parent and interleaved table at the
   564    same time. It is possible that this will not fall into future work
   565    but be required in the initial implementation in order to support
   566    interleaved tables.
   567  
   568  ## Unresolved questions
   569  
   570  * How is the central coordinator selected? Should there be a different
   571    coordinator per-table? How is a request to analyze a table directed
   572    to the central coordinator?
   573  
   574  * Should stats be segregated by partition? Doing so could help if we
   575    expect significant partition-specific data distributions. Would
   576    doing so make cross-partition queries more difficult to optimize?
   577    See the [table partitioning](20170208_sql_partitioning.md)
   578    RFC. Segregating stats by partition might make repartitioning more
   579    difficult. Or perhaps it makes stats inaccurate for a period of time
   580    after repartitioning. If partitions do not have a partition ID (one
   581    of the proposals) we'll need to find some way to identify the
   582    partition for stats storage.
   583  
   584  * What if a column has very large values? Perhaps we want to adjust
   585    the number of buckets depending on that.