github.com/dolthub/go-mysql-server@v0.18.0/sql/stats/doc.go (about) 1 package stats 2 3 // The `stats` package includes a generic implementation of the sql.Statistic 4 // interface and mutator methods for applying filter predicates to statistics. 5 // 6 // Statistics are used for optimizer transformations whose optimal plans 7 // depends on the contents and shape of rows in a database, as opposed to 8 // normalization transformations that always simplify and improve a plan's 9 // runtime. 10 // 11 // The process of exploring the search space of query plans based on table 12 // data shape is called "costed exploration", or just "costing". Costing 13 // includes: 14 // - indexed range scans given a set of filters and indexes 15 // - join ordering and join execution operators 16 // - sort enforcement via indexing or sort nodes 17 // - aggregating before or after a join 18 // - executing scalar subqueries as lateral joins or as nested relations 19 // 20 // Statistics as the source of truth for comparing the costs of different 21 // query plans. The key metric for comparison is "cardinality", which is the 22 // number of rows that a relation is expected to return. Relational operators 23 // mutate child statistics to accumulate the effect of that operator on the 24 // output cardinality. 25 // 26 // For example, if the query below applies an equality filter to a tablescan: 27 // 28 // SELECT * from mytable where i > 0 29 // 30 // The cardinality of the total output depends on 1) the number of rows in 31 // mytable, and 2) the fraction of mytable that is positive. The cardinality 32 // changes as the values in mytable changes. If mytable has 10 non-negative 33 // rows, the cardinality of the query is 10. If we run the query below: 34 // 35 // UPDATE mytable set i = i-5; 36 // 37 // the cardinality of the query will now depend on whether we've made any 38 // previously positive |i| values zero or negative. 39 // 40 // Histogram 41 // 42 // Histograms are the key data structure we use to maintain statistics in 43 // response to real DML mutations and theoretical mutations by relational 44 // operators. 45 // 46 // A histogram is an ordered set of partitions of a table index. The ideal 47 // histogram has evenly spaced partitions, enough partitions to provide 48 // useful metric accuracy, and few enough partitions that updating 49 // statistics is fast. 50 // 51 // A histogram might look something like this: 52 // 53 // -10: ****** 54 // 0: ********** 55 // 10: **** 56 // 57 // This histogram has 5 buckets. Upper bound values are explicit. The 58 // contents of each bucket is visualized by the number of asterisks. We 59 // interpret this visualization as: there are 6 rows less than or equal to -10, 60 // 10 rows between 0 and 10, and 4 rows between 0 and 10. 61 // 62 // We can store this information in a compact form: 63 // 64 // type Histogram struct { 65 // Buckets []struct{ 66 // UpperBound int 67 // RowCount int 68 // } 69 // } 70 // 71 // Going back to our query: 72 // 73 // SELECT * from mytable where i > 0 74 // 75 // Our estimate for the output cardinality without a histogram would be 76 // anywhere between 0 and 20 rows. The histogram gives a more accurate 77 // estimate, 4. If the histogram is completely up-to-date with the contents 78 // of the database, this histogram metric will be the exact value. 79 // 80 // Index Costing 81 // 82 // We use histograms to compare index options for range scans. For example, 83 // we might have a query with several filters: 84 // 85 // CREATE TABLE ab ( 86 // a int, 87 // b int, 88 // PRIMARY KEY(a,b), 89 // KEY(b,a) 90 // ); 91 // 92 // SELECT * FROM ab WHERE a > 2 and b < 4; 93 // 94 // We have a choice whether to read index (ab) from (2,∞), or index (ba) from 95 // (-∞,4). We consult the histogram to find which plan is the cheapest: 96 // 97 // (a,b) 98 // 0,8: ********** (10) 99 // 5,12: **** (4) 100 // 10,5: ******** (8) 101 // 15,20: ***** (5) 102 // 103 // (b,a) 104 // 0,0: * (1) 105 // 2,1: *** (3) 106 // 6,5: ****** (6) 107 // 10,2: ***************** (17) 108 // 109 // We notice that (a) values are consistently distributed from 0-15+, but (b) 110 // values are clustered between 6-10+. Applying the predicates to the 111 // appropriate indexes yields range scan estimates: 112 // - (ab)->(2,∞) = 17 (4+8+5) 113 // - (ba)->(-∞,4) = 10 (1+3+6) 114 // 115 // The output cardinality after applying both filters might not be 10, but the 116 // cheapest plan will use (ba) to read the 10 rows from disk before applying 117 // the (a > 2) filter. 118 // 119 // Similar logic applies to multiple and overlapping filters. We accumulate the 120 // effect of statistics truncation to estimate which index and combination of 121 // filters reads the fewest rows from disk. 122 123 // TODO MCVs 124 // TODO unique count 125 // TODO joins 126 //