github.com/dolthub/go-mysql-server@v0.18.0/sql/stats/doc.go

github.com/dolthub/go-mysql-server@v0.18.0/sql/stats/doc.go (about)

     1  package stats
     2  
     3  // The `stats` package includes a generic implementation of the sql.Statistic
     4  // interface and mutator methods for applying filter predicates to statistics.
     5  //
     6  // Statistics are used for optimizer transformations whose optimal plans
     7  // depends on the contents and shape of rows in a database, as opposed to
     8  // normalization transformations that always simplify and improve a plan's
     9  // runtime.
    10  //
    11  // The process of exploring the search space of query plans based on table
    12  // data shape is called "costed exploration", or just "costing". Costing
    13  // includes:
    14  //   - indexed range scans given a set of filters and indexes
    15  //   - join ordering and join execution operators
    16  //   - sort enforcement via indexing or sort nodes
    17  //   - aggregating before or after a join
    18  //   - executing scalar subqueries as lateral joins or as nested relations
    19  //
    20  // Statistics as the source of truth for comparing the costs of different
    21  // query plans. The key metric for comparison is "cardinality", which is the
    22  // number of rows that a relation is expected to return. Relational operators
    23  // mutate child statistics to accumulate the effect of that operator on the
    24  // output cardinality.
    25  //
    26  // For example, if the query below applies an equality filter to a tablescan:
    27  //
    28  // SELECT * from mytable where i > 0
    29  //
    30  // The cardinality of the total output depends on 1) the number of rows in
    31  // mytable, and 2) the fraction of mytable that is positive. The cardinality
    32  // changes as the values in mytable changes. If mytable has 10 non-negative
    33  // rows, the cardinality of the query is 10. If we run the query below:
    34  //
    35  // UPDATE mytable set i = i-5;
    36  //
    37  // the cardinality of the query will now depend on whether we've made any
    38  // previously positive |i| values zero or negative.
    39  //
    40  // Histogram
    41  //
    42  // Histograms are the key data structure we use to maintain statistics in
    43  // response to real DML mutations and theoretical mutations by relational
    44  // operators.
    45  //
    46  // A histogram is an ordered set of partitions of a table index. The ideal
    47  // histogram has evenly spaced partitions, enough partitions to provide
    48  // useful metric accuracy, and few enough partitions that updating
    49  // statistics is fast.
    50  //
    51  // A histogram might look something like this:
    52  //
    53  // -10: ******
    54  // 0: **********
    55  // 10: ****
    56  //
    57  // This histogram has 5 buckets. Upper bound values are explicit. The
    58  // contents of each bucket is visualized by the number of asterisks. We
    59  // interpret this visualization as: there are 6 rows less than or equal to -10,
    60  // 10 rows between 0 and 10, and 4 rows between 0 and 10.
    61  //
    62  // We can store this information in a compact form:
    63  //
    64  // type Histogram struct {
    65  //   Buckets []struct{
    66  //     UpperBound int
    67  //     RowCount   int
    68  //   }
    69  // }
    70  //
    71  // Going back to our query:
    72  //
    73  // SELECT * from mytable where i > 0
    74  //
    75  // Our estimate for the output cardinality without a histogram would be
    76  // anywhere between 0 and 20 rows. The histogram gives a more accurate
    77  // estimate, 4. If the histogram is completely up-to-date with the contents
    78  // of the database, this histogram metric will be the exact value.
    79  //
    80  // Index Costing
    81  //
    82  // We use histograms to compare index options for range scans. For example,
    83  // we might have a query with several filters:
    84  //
    85  // CREATE TABLE ab (
    86  //   a int,
    87  //   b int,
    88  //   PRIMARY KEY(a,b),
    89  //   KEY(b,a)
    90  // );
    91  //
    92  // SELECT * FROM ab WHERE a > 2 and b < 4;
    93  //
    94  // We have a choice whether to read index (ab) from (2,∞), or index (ba) from
    95  // (-∞,4). We consult the histogram to find which plan is the cheapest:
    96  //
    97  // (a,b)
    98  //  0,8:   ********** (10)
    99  //  5,12:  **** (4)
   100  //  10,5:  ******** (8)
   101  //  15,20: ***** (5)
   102  //
   103  // (b,a)
   104  //  0,0:  * (1)
   105  //  2,1:  *** (3)
   106  //  6,5:  ****** (6)
   107  //  10,2: ***************** (17)
   108  //
   109  // We notice that (a) values are consistently distributed from 0-15+, but (b)
   110  // values are clustered between 6-10+. Applying the predicates to the
   111  // appropriate indexes yields range scan estimates:
   112  //  - (ab)->(2,∞)  = 17 (4+8+5)
   113  //  - (ba)->(-∞,4) = 10 (1+3+6)
   114  //
   115  // The output cardinality after applying both filters might not be 10, but the
   116  // cheapest plan will use (ba) to read the 10 rows from disk before applying
   117  // the (a > 2) filter.
   118  //
   119  // Similar logic applies to multiple and overlapping filters. We accumulate the
   120  // effect of statistics truncation to estimate which index and combination of
   121  // filters reads the fewest rows from disk.
   122  
   123  // TODO MCVs
   124  // TODO unique count
   125  // TODO joins
   126  //