github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200507_partial_indexes.md (about)

     1  - Feature Name: Partial Indexes
     2  - Status: in-progress
     3  - Start Date: 2020-05-07
     4  - Authors: mgartner
     5  - RFC PR: [#48557](https://github.com/cockroachdb/cockroach/pull/48557)
     6  - Cockroach Issue: [#9683](https://github.com/cockroachdb/cockroach/issues/9683)
     7  
     8  # Summary
     9  
    10  This RFC proposes the addition of partial indexes to CockroachDB. A partial
    11  index is an index with a boolean predicate expression that only indexes rows in
    12  which the predicate evaluates to true.
    13  
    14  Partial indexes are a common feature in RDBMSs. They can be beneficial in
    15  multiple use-cases. Partial indexes can:
    16  
    17  - Improve read performance, like normal indexes, without overhead for writes
    18    that don't match the predicate expression.
    19  - Scan fewer rows than normal indexes on the same columns due to the partial
    20    index indexing a subset of rows.
    21  - Reduce the total size of the set of indexes required to satisfy queries, by
    22    both reducing the number of rows indexed, and reducing the number of columns
    23    indexed.
    24  - Ensure uniqueness on a subset of rows in a table via `CREATE UNIQUE INDEX
    25    ...`.
    26  
    27  # Guide-level Explanation
    28  
    29  ## Usage
    30  
    31  Partial indexes are created by including a _predicate expression_ via `WHERE
    32  <predicate>` in a `CREATE INDEX` statement. For example:
    33  
    34  ```sql
    35  CREATE INDEX popular_products ON products (price) WHERE units_sold > 1000
    36  ```
    37  
    38  The `popular_products` index only indexes rows where the `units_sold` column has
    39  a value greater than `1000`.
    40  
    41  Partial indexes can only be used to satisfy a query that has a _filter
    42  expression_, `WHERE <filter>`, that implies the predicate expression. For
    43  example, consider the following queries:
    44  
    45  ```sql
    46  SELECT max(price) FROM products
    47  
    48  SELECT max(price) FROM products WHERE review_count > 100
    49  
    50  SELECT max(price) FROM products WHERE units_sold > 500
    51  
    52  SELECT max(price) FROM products WHERE units_sold > 1500
    53  ```
    54  
    55  Only the last query can utilize the partial index `popular_products`. Its filter
    56  expression, `units_sold > 1500`, _implies_ the predicate expression, `units_sold
    57  > 1000`. Every value for `units_sold` that is greater than `1500` is also
    58  greater than `1000`. Stated differently, the predicate expression _contains_ the
    59  filter expression.
    60  
    61  Note that CRDB, like Postgres, will perform a best-effort attempt to prove that
    62  a query filter expression implies a partial index predicate. It is not
    63  guaranteed to prove implication of arbitrarily complex expressions.
    64  
    65  ## Valid Predicate Expressions
    66  
    67  There are some notable restrictions that are enforced on partial index
    68  predicates.
    69  
    70  1. They must result in a boolean.
    71  2. They can only refer to columns in the table being indexed.
    72  3. Functions used within predicates must be immutable. For example, `now()` is
    73     not allowed because its result depends on more than its arguments.
    74  
    75  ## Index Hints
    76  
    77  Unlike full indexes, a partial index cannot be used to satisfy all queries.
    78  Queries that don't imply the partial index predicate may need to return rows
    79  that do not exist in the partial index.
    80  
    81  Therefore, hinting a partial index, with `table@index`, will behave as follows:
    82  
    83  * If the query filter expression can be proved to imply the partial index
    84    predicate, the partial index will be used in the query plan.
    85  * If not, an error will be returned.
    86  
    87  # Reference-level Explanation
    88  
    89  This design covers 5 major aspects of implementing partial indexes: parsing,
    90  testing predicate implication, generating partial index scans, statistics, and
    91  mutation.
    92  
    93  ## Parsing
    94  
    95  The parser will be updated to support new syntax for creating partial indexes.
    96  Below are examples of statements that will be supported.
    97  
    98  ```sql
    99  -- CREATE INDEX
   100  CREATE INDEX ON a (b) WHERE c > 3
   101  CREATE UNIQUE INDEX ON a (b) WHERE c > 3
   102  CREATE INVERTED INDEX ON a (b) WHERE c > 3
   103  
   104  -- CREATE TABLE ... INDEX (not supported by Postgres)
   105  CREATE TABLE a (b INT, INDEX (b) WHERE b > 3)
   106  CREATE TABLE a (b INT, UNIQUE INDEX (b) WHERE b > 3)
   107  CREATE TABLE a (b INT, INVERTED INDEX (b) WHERE b > 3)
   108  
   109  --- CREATE TABLE ... CONSTRAINT (not supported by Postgres)
   110  CREATE TABLE a (b INT, CONSTRAINT c UNIQUE (b) WHERE b > 3)
   111  ```
   112  
   113  In general, the partial index predicate will be the last optional term for
   114  statements that create indexes. For example, below is an overview of the syntax
   115  supported for `CREATE INDEX`.
   116  
   117  ```
   118  CREATE [UNIQUE] [INVERTED] INDEX [name]
   119     ON tbl (cols...)
   120     [STORING ( ... )]
   121     [INTERLEAVE ...]
   122     [PARTITION BY ...]
   123     [WHERE ...]
   124  ```
   125  
   126  Note that the `WHERE predicate` modifier will not be allowed in column
   127  qualifiers. For example, the statement below will **NOT** be supported:
   128  
   129  ```sql
   130  CREATE TABLE a (k INT PRIMARY KEY, b INT UNIQUE WHERE k = 0)
   131  ```
   132  
   133  In order to ensure that predicates are valid (e.g., they result in booleans and
   134  contain no impure functions), we will use the same logic that validates `CHECK`
   135  constraints, `sqlbase.SanitizeVarFreeExpr`. The restrictions for `CHECK`
   136  constraints and partial index predicates are the same.
   137  
   138  ## Testing Predicate Implication
   139  
   140  In order to use a partial index to satisfy a query, the filter expression of the
   141  query must _imply_ that the partial index predicate is true. If the predicate is
   142  not provably true, the rows to be returned may not exist in the partial index,
   143  and it cannot be used. Note that other indexes, partial or not, could still be
   144  used to satisfy the query.
   145  
   146  ### Exact matches
   147  
   148  First, we will check if any conjuncted-expression in the filter is an exact
   149  match to the predicate.
   150  
   151  For example, consider the filter expression `a > 10 AND b < 100` and the partial
   152  index predicate `b < 100`. The second conjuncted expression in the filter, `b <
   153  100`, is an exact match to the predicate `b < 100`. Therefore this filter
   154  implies this predicate.
   155  
   156  We can test for pointer equality to check if the conjuncted-expressions are an
   157  exact match. The `interner` ensures that identical expressions have the same
   158  memory address.
   159  
   160  ### Non-exact matches
   161  
   162  There are cases when an expression implies a predicate, but is not an exact
   163  match.
   164  
   165  For example, `a > 10` implies `a > 0` because all values for `a` that satisfy
   166  `a > 10` also satisfy `a > 0`.
   167  
   168  Constraints and constraint sets can be leveraged to help perform implication
   169  checks. However, they are not a full solution. Constraint sets cannot represent
   170  a disjunction with different columns on each side.
   171  
   172  Consider the following example:
   173  
   174  ```sql
   175  CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, review_count INT)
   176  CREATE INDEX popular_prds ON t (price) WHERE units_sold > 1000 OR review_count > 100
   177  ```
   178  
   179  No constraint can be created for the top-level predicate expression of
   180  `popular_prds`.
   181  
   182  Therefore, constraints alone cannot help us determine that `popular_prds` can be
   183  scanned to satisfy any of the below queries:
   184  
   185  ```sql
   186  SELECT COUNT(id) FROM products WHERE units_sold > 1500 AND price > 100
   187  
   188  SELECT COUNT(id) FROM products WHERE review_count > 200 AND price < 100
   189  
   190  SELECT COUNT(id) FROM products WHERE (units_sold > 1000 OR review_count > 200) AND price < 100
   191  ```
   192  
   193  In order to accommodate for such expressions, we must walk the filter and
   194  expression trees. At each predicate expression node, we will check if it is
   195  implied by the filter expression node.
   196  
   197  Postgres's [predtest library](https://github.com/postgres/postgres/blob/c9d29775195922136c09cc980bb1b7091bf3d859/src/backend/optimizer/util/predtest.c#L251-L287)
   198  uses this method to determine if a partial index can be used to satisfy a query.
   199  The logic Postgres uses for testing implication of conjunctions, disjunctions,
   200  and "atoms" (anything that is not an `AND` or `OR`) is as follows:
   201  
   202      ("=>" means "implies")
   203  
   204      atom A => atom B if:          A contains B
   205      atom A => AND-expr B if:      A => each of B's children
   206      atom A => OR-expr B if:       A => any of B's children
   207  
   208      AND-expr A => atom B if:      any of A's children => B
   209      AND-expr A => AND-expr B if:  A => each of B's children
   210      AND-expr A => OR-expr B if:   A => any of B's children OR
   211                                      any of A's children => B
   212  
   213      OR-expr A => atom B if:       each of A's children => B
   214      OR-expr A => AND-expr B if:   A => each of B's children
   215      OR-expr A => OR-expr B if:    each of A's children => any of B's children
   216  
   217  Because atoms will not contain any `AND` or `OR` expressions, we can generate a
   218  `constraint.Span` for each of them in order to check for containment. There may
   219  be edge-cases which cannot be handled by `constraint.Span`, such as `IS NULL`
   220  expressions or multi-column values, like tuples.
   221  
   222  At a high-level, to test whether or not `atom A => atom B`, we can perform the
   223  following tests, in order:
   224  
   225  1. If the atoms are equal (pointer equality), then `A => B`.
   226  2. If the column referenced in `A` is not the column referenced in `B`, then `A`
   227     does not imply `B`.
   228  3. If the `constraint.Span` of `A` is contained by the `constraint.Span` of `B`,
   229     then `A => B`.
   230  
   231  There may be special considerations required for handling multi-column atoms,
   232  such as `(a, b) > (1, 2)`. Multi-column spans should be helpful in proving
   233  containment, though it should be noted that Postgres only supports simple
   234  multiple column implications.
   235  
   236  The time complexity of this check is `O(P * F)`, where `P` is the number of
   237  nodes in the predicate expression and `F` is the number of nodes in the filter
   238  expression.
   239  
   240  ## Generating Partial Index Scans
   241  
   242  We will consider utilizing partial indexes for both unconstrained and
   243  constrained scans. Therefore, we'll need to modify both the `GenerateIndexScans`
   244  and `GenerateConstrainedScans` exploration rules (or make new, similar rules).
   245  
   246  In addition, we'll need to update exploration rules for zig-zag joins and
   247  inverted index scans.
   248  
   249  We'll remove redundant filters from the expression when generating a scan over a
   250  partial index. For example:
   251  
   252  ```sql
   253  CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, units_in_stock INT)
   254  CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000
   255  
   256  SELECT * FROM products WHERE price > 20 AND units_sold > 1000 AND units_in_stock > 0
   257  ```
   258  
   259  When generating the constrained scan over `idx1`, the `units_sold > 1000` filter
   260  can be removed from the outer `Select`, such that only the `units_in_stock > 0`
   261  filter remains.
   262  
   263  Only conjuncted filter expressions that _exactly_ match the
   264  predicate expression can be removed. For example, a filter expression
   265  `units_sold > 1200` could not be removed. This filter would remain and be
   266  applied after the scan in order to remove any rows returned by the scan with
   267  `units_sold` between `1000` and `1200`.
   268  
   269  ## Statistics
   270  
   271  The statistics builder must take into account the predicate expression, in
   272  addition to the filter expression, when generating statistics for a partial
   273  index scan. This is because the number of rows examined via a partial index scan
   274  is dependent on the predicate expression.
   275  
   276  For example, consider the following table, indexes, and query:
   277  
   278  ```sql
   279  CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, type TEXT)
   280  CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000
   281  CREATE INDEX idx2 ON products (price) WHERE units_sold > 1000 AND type = 'toy'
   282  
   283  SELECT COUNT(*) FROM products WHERE units_sold > 1000 AND type = 'toy' AND price > 20
   284  ```
   285  
   286  A scan on `idx1` will scan `[/1001 - ]`. A scan on on `idx2` will have the same
   287  scan, `[/1001 - ]`, but will examine fewer rows - only those where `type = 'toy'`.
   288  Therefore, the optimizer cannot rely solely on the scan constraints to determine
   289  the number of rows returned from scanning a partial index. It must also take
   290  into account the selectivity of the predicate to correctly determine that
   291  scanning `idx2` is a lower-cost plan than scanning `idx1`.
   292  
   293  
   294  ["The Case For Partial Indexes"](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf)
   295  details how to estimate the number of rows examined by a partial index scan,
   296  based on the selectivity of the filter and predicate expressions. This can be
   297  used as a starting point for making adjustments to the statistics builder.
   298  
   299  The statistics builder must account for the special case when predicate
   300  expressions include columns that are indexed. For example, consider the
   301  following table, index, and query.
   302  
   303  ```sql
   304  CREATE TABLE products (id INT PRIMARY KEY, units_sold INT)
   305  CREATE INDEX idx ON products (units_sold) WHERE units_sold > 100
   306  
   307  SELECT COUNT(*) FROM products WHERE units_sold > 200
   308  ```
   309  
   310  In this case, the constraint spans of the predicate and query filter are
   311  `[/101 - ]` and `[/201 - ]`, respectively. Because the constraints apply to the
   312  same column, the selectivity of each is _not_ independent. The number of rows
   313  examined is approximately the total number of rows in the table, multiplied by
   314  the selectivity of `[/201 - ]`. It would be inaccurate to instead estimate the
   315  number of rows examined as the total number of rows in the table, multiplied by
   316  the selectivity of `[/201 - ]` and the selectivity of `[/101 - ]`.
   317  
   318  ## Mutation
   319  
   320  Partial indexes only index rows that satisfy the partial index's predicate
   321  expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and
   322  `DELETE`s to a table must update the partial index in the event that they change
   323  the candidacy of a row. Partial indexes must also be updated if an `UPDATE`d row
   324  matches the predicate both before and after the update, and the value of the
   325  indexed columns change.
   326  
   327  In order for the execution engine to determine when a partial index needs to be
   328  updated, the optimizer will project boolean columns that represent whether or not
   329  partial indexes will be updated. This will operate similarly to `CHECK`
   330  constraint verification.
   331  
   332  ### Insert
   333  
   334  If the row being inserted satisfies the predicate, write to the partial index.
   335  
   336  ### Delete
   337  
   338  If the row being deleted satisfies the predicate, delete it from the partial
   339  index.
   340  
   341  ### Updates
   342  
   343  Updates will require two columns to be projected for each partial index. The
   344  first is true if the old version of the row is in the index and needs to be
   345  deleted. The second is true if the new version of the row needs to be written to
   346  the index.
   347  
   348  Consider the following table of possibilities, where:
   349  
   350   * `r` is the version of the row before the update
   351   * `r'` is the version of the row after the update
   352   * `pred_match(r)` is `true` when `r` matches the partial index predicate
   353  
   354  |                Case                  | Delete `r` from index | Insert `r'` to index |
   355  | ------------------------------------ | --------------------- | -------------------- |
   356  | `pred_match(r) AND !pred_match(r')`  | `True`                | `False`              |
   357  | `!pred_match(r) AND pred_match(r')`  | `False`               | `True`               |
   358  | `pred_match(r) AND pred_match(r')`*  | `True`                | `True`               |
   359  | `!pred_match(r) AND !pred_match(r')` | `False`               | `False`              |
   360  
   361  
   362  *Note that in the case that the row was already in the partial index and will
   363  remain in the partial index after the update, the index only needs to be updated
   364  (delete `r` and insert `r'`) if the value of the indexed columns changes. If the
   365  value of the indexed columns is not changing, there is no need to update the
   366  index.
   367  
   368  ### Primary Key Changes
   369  
   370  If a primary key change occurs, the partial index will need to be rewritten so
   371  that the values in the index store the new primary key. This is similar to other
   372  secondary indexes, only that the added complexity of checking if rows belong in
   373  the partial index must be considered.
   374  
   375  # Alternatives considered
   376  
   377  ## Disallow `OR` operators in partial index predicates
   378  
   379  **This alternative is not being considered because it would make CRDB partial
   380  indexes incompatible with Postgres's partial indexes.**
   381  
   382  Testing for predicate implication could be simplified by disallowing `OR`
   383  operators in partial index predicates. A predicate expression without `OR` can
   384  always be represented by a constraint. Therefore, to test if a filter implies
   385  the predicate, we simply check if any of the filter's constraints contain the
   386  predicate constraint. Walking the expression trees would not be required.
   387  
   388  [SQL Server imposes this limitation for its form of partial
   389  indexes](https://docs.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql?view=sql-server-ver15).
   390  Such an expression could always be represented by a constraint. Therefore, to
   391  test if a filter implies the predicate, we simply check if any of the filter's
   392  constraints contain the predicate constraint.
   393  
   394  Note that the `IN` operator would still be allowed, which provides a form of
   395  disjunction. The `IN` operator can easily be supported because it represents a
   396  disjunction on only one column, which a constraint _can_ represent.
   397  
   398  # Work Items
   399  
   400  Below is a list of the steps (PRs) to implement partial indexes, roughly
   401  ordered.
   402  
   403  - [ ] Add partial index predicate to internal index data structures, add parser
   404    support for `WHERE <predicate>`, add a cluster flag for gating this
   405    defaulted to "off"
   406  - [ ] Add simple equality implication check to optimizer when generating index
   407    scans, in GenerateIndexScans.
   408  - [ ] Same, for GenerateConstrainedScans.
   409  - [ ] Add support for updating partial indexes on inserts.
   410  - [ ] Add support for updating partial indexes on deletes.
   411  - [ ] Add support for updating partial indexes on updates and upserts.
   412  - [ ] Add support for backfilling partial indexes.
   413  - [ ] Update the statistics builder to account for the selectivity of the partial index
   414    predicate.
   415  - [ ] Add more advanced implication logic for filter and predicate expressions.
   416  - [ ] Add support in other index exploration rules:
   417    - [ ] GenerateInvertedIndexScans
   418    - [ ] GenerateZigZagJoin
   419    - [ ] GenerateInvertedIndexZigZagJoin
   420  - [ ] Add support for partitioned partial indexes
   421  - [ ] Add support for using partial indexes in Lookup Joins
   422  - [ ] Consider using partial indexes for auto-generated indexes used for foreign
   423    keys.
   424  - [ ] [Stretch goal] Add support for `ON CONFLICT WHERE [index_predicate] DO
   425    ...` for identifying conflict behavior for uniquer partial indexes.
   426    - More info in the [Postgres
   427      docs](https://www.postgresql.org/docs/9.5/sql-insert.html#SQL-ON-CONFLICT)
   428      and [this blog
   429      post](https://medium.com/@betakuang/why-postgresqls-on-conflict-cannot-find-my-partial-unique-index-552327b85e1)
   430  
   431  # Resources
   432  
   433  - [Postgres partial indexes documentation](https://www.postgresql.org/docs/current/indexes-partial.html)
   434  - [Postgres CREATE INDEX documentation](https://www.postgresql.org/docs/12/sql-createindex.html)
   435  - [Postgres predicate test source code](https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/predtest.c)
   436  - ["The Case For Partial Indexes", Michael Stonebraker](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf)
   437  - [Use the Index Luke - Partial Indexes](https://use-the-index-luke.com/sql/where-clause/partial-and-filtered-indexes)