github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200507_partial_indexes.md (about) 1 - Feature Name: Partial Indexes 2 - Status: in-progress 3 - Start Date: 2020-05-07 4 - Authors: mgartner 5 - RFC PR: [#48557](https://github.com/cockroachdb/cockroach/pull/48557) 6 - Cockroach Issue: [#9683](https://github.com/cockroachdb/cockroach/issues/9683) 7 8 # Summary 9 10 This RFC proposes the addition of partial indexes to CockroachDB. A partial 11 index is an index with a boolean predicate expression that only indexes rows in 12 which the predicate evaluates to true. 13 14 Partial indexes are a common feature in RDBMSs. They can be beneficial in 15 multiple use-cases. Partial indexes can: 16 17 - Improve read performance, like normal indexes, without overhead for writes 18 that don't match the predicate expression. 19 - Scan fewer rows than normal indexes on the same columns due to the partial 20 index indexing a subset of rows. 21 - Reduce the total size of the set of indexes required to satisfy queries, by 22 both reducing the number of rows indexed, and reducing the number of columns 23 indexed. 24 - Ensure uniqueness on a subset of rows in a table via `CREATE UNIQUE INDEX 25 ...`. 26 27 # Guide-level Explanation 28 29 ## Usage 30 31 Partial indexes are created by including a _predicate expression_ via `WHERE 32 <predicate>` in a `CREATE INDEX` statement. For example: 33 34 ```sql 35 CREATE INDEX popular_products ON products (price) WHERE units_sold > 1000 36 ``` 37 38 The `popular_products` index only indexes rows where the `units_sold` column has 39 a value greater than `1000`. 40 41 Partial indexes can only be used to satisfy a query that has a _filter 42 expression_, `WHERE <filter>`, that implies the predicate expression. For 43 example, consider the following queries: 44 45 ```sql 46 SELECT max(price) FROM products 47 48 SELECT max(price) FROM products WHERE review_count > 100 49 50 SELECT max(price) FROM products WHERE units_sold > 500 51 52 SELECT max(price) FROM products WHERE units_sold > 1500 53 ``` 54 55 Only the last query can utilize the partial index `popular_products`. Its filter 56 expression, `units_sold > 1500`, _implies_ the predicate expression, `units_sold 57 > 1000`. Every value for `units_sold` that is greater than `1500` is also 58 greater than `1000`. Stated differently, the predicate expression _contains_ the 59 filter expression. 60 61 Note that CRDB, like Postgres, will perform a best-effort attempt to prove that 62 a query filter expression implies a partial index predicate. It is not 63 guaranteed to prove implication of arbitrarily complex expressions. 64 65 ## Valid Predicate Expressions 66 67 There are some notable restrictions that are enforced on partial index 68 predicates. 69 70 1. They must result in a boolean. 71 2. They can only refer to columns in the table being indexed. 72 3. Functions used within predicates must be immutable. For example, `now()` is 73 not allowed because its result depends on more than its arguments. 74 75 ## Index Hints 76 77 Unlike full indexes, a partial index cannot be used to satisfy all queries. 78 Queries that don't imply the partial index predicate may need to return rows 79 that do not exist in the partial index. 80 81 Therefore, hinting a partial index, with `table@index`, will behave as follows: 82 83 * If the query filter expression can be proved to imply the partial index 84 predicate, the partial index will be used in the query plan. 85 * If not, an error will be returned. 86 87 # Reference-level Explanation 88 89 This design covers 5 major aspects of implementing partial indexes: parsing, 90 testing predicate implication, generating partial index scans, statistics, and 91 mutation. 92 93 ## Parsing 94 95 The parser will be updated to support new syntax for creating partial indexes. 96 Below are examples of statements that will be supported. 97 98 ```sql 99 -- CREATE INDEX 100 CREATE INDEX ON a (b) WHERE c > 3 101 CREATE UNIQUE INDEX ON a (b) WHERE c > 3 102 CREATE INVERTED INDEX ON a (b) WHERE c > 3 103 104 -- CREATE TABLE ... INDEX (not supported by Postgres) 105 CREATE TABLE a (b INT, INDEX (b) WHERE b > 3) 106 CREATE TABLE a (b INT, UNIQUE INDEX (b) WHERE b > 3) 107 CREATE TABLE a (b INT, INVERTED INDEX (b) WHERE b > 3) 108 109 --- CREATE TABLE ... CONSTRAINT (not supported by Postgres) 110 CREATE TABLE a (b INT, CONSTRAINT c UNIQUE (b) WHERE b > 3) 111 ``` 112 113 In general, the partial index predicate will be the last optional term for 114 statements that create indexes. For example, below is an overview of the syntax 115 supported for `CREATE INDEX`. 116 117 ``` 118 CREATE [UNIQUE] [INVERTED] INDEX [name] 119 ON tbl (cols...) 120 [STORING ( ... )] 121 [INTERLEAVE ...] 122 [PARTITION BY ...] 123 [WHERE ...] 124 ``` 125 126 Note that the `WHERE predicate` modifier will not be allowed in column 127 qualifiers. For example, the statement below will **NOT** be supported: 128 129 ```sql 130 CREATE TABLE a (k INT PRIMARY KEY, b INT UNIQUE WHERE k = 0) 131 ``` 132 133 In order to ensure that predicates are valid (e.g., they result in booleans and 134 contain no impure functions), we will use the same logic that validates `CHECK` 135 constraints, `sqlbase.SanitizeVarFreeExpr`. The restrictions for `CHECK` 136 constraints and partial index predicates are the same. 137 138 ## Testing Predicate Implication 139 140 In order to use a partial index to satisfy a query, the filter expression of the 141 query must _imply_ that the partial index predicate is true. If the predicate is 142 not provably true, the rows to be returned may not exist in the partial index, 143 and it cannot be used. Note that other indexes, partial or not, could still be 144 used to satisfy the query. 145 146 ### Exact matches 147 148 First, we will check if any conjuncted-expression in the filter is an exact 149 match to the predicate. 150 151 For example, consider the filter expression `a > 10 AND b < 100` and the partial 152 index predicate `b < 100`. The second conjuncted expression in the filter, `b < 153 100`, is an exact match to the predicate `b < 100`. Therefore this filter 154 implies this predicate. 155 156 We can test for pointer equality to check if the conjuncted-expressions are an 157 exact match. The `interner` ensures that identical expressions have the same 158 memory address. 159 160 ### Non-exact matches 161 162 There are cases when an expression implies a predicate, but is not an exact 163 match. 164 165 For example, `a > 10` implies `a > 0` because all values for `a` that satisfy 166 `a > 10` also satisfy `a > 0`. 167 168 Constraints and constraint sets can be leveraged to help perform implication 169 checks. However, they are not a full solution. Constraint sets cannot represent 170 a disjunction with different columns on each side. 171 172 Consider the following example: 173 174 ```sql 175 CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, review_count INT) 176 CREATE INDEX popular_prds ON t (price) WHERE units_sold > 1000 OR review_count > 100 177 ``` 178 179 No constraint can be created for the top-level predicate expression of 180 `popular_prds`. 181 182 Therefore, constraints alone cannot help us determine that `popular_prds` can be 183 scanned to satisfy any of the below queries: 184 185 ```sql 186 SELECT COUNT(id) FROM products WHERE units_sold > 1500 AND price > 100 187 188 SELECT COUNT(id) FROM products WHERE review_count > 200 AND price < 100 189 190 SELECT COUNT(id) FROM products WHERE (units_sold > 1000 OR review_count > 200) AND price < 100 191 ``` 192 193 In order to accommodate for such expressions, we must walk the filter and 194 expression trees. At each predicate expression node, we will check if it is 195 implied by the filter expression node. 196 197 Postgres's [predtest library](https://github.com/postgres/postgres/blob/c9d29775195922136c09cc980bb1b7091bf3d859/src/backend/optimizer/util/predtest.c#L251-L287) 198 uses this method to determine if a partial index can be used to satisfy a query. 199 The logic Postgres uses for testing implication of conjunctions, disjunctions, 200 and "atoms" (anything that is not an `AND` or `OR`) is as follows: 201 202 ("=>" means "implies") 203 204 atom A => atom B if: A contains B 205 atom A => AND-expr B if: A => each of B's children 206 atom A => OR-expr B if: A => any of B's children 207 208 AND-expr A => atom B if: any of A's children => B 209 AND-expr A => AND-expr B if: A => each of B's children 210 AND-expr A => OR-expr B if: A => any of B's children OR 211 any of A's children => B 212 213 OR-expr A => atom B if: each of A's children => B 214 OR-expr A => AND-expr B if: A => each of B's children 215 OR-expr A => OR-expr B if: each of A's children => any of B's children 216 217 Because atoms will not contain any `AND` or `OR` expressions, we can generate a 218 `constraint.Span` for each of them in order to check for containment. There may 219 be edge-cases which cannot be handled by `constraint.Span`, such as `IS NULL` 220 expressions or multi-column values, like tuples. 221 222 At a high-level, to test whether or not `atom A => atom B`, we can perform the 223 following tests, in order: 224 225 1. If the atoms are equal (pointer equality), then `A => B`. 226 2. If the column referenced in `A` is not the column referenced in `B`, then `A` 227 does not imply `B`. 228 3. If the `constraint.Span` of `A` is contained by the `constraint.Span` of `B`, 229 then `A => B`. 230 231 There may be special considerations required for handling multi-column atoms, 232 such as `(a, b) > (1, 2)`. Multi-column spans should be helpful in proving 233 containment, though it should be noted that Postgres only supports simple 234 multiple column implications. 235 236 The time complexity of this check is `O(P * F)`, where `P` is the number of 237 nodes in the predicate expression and `F` is the number of nodes in the filter 238 expression. 239 240 ## Generating Partial Index Scans 241 242 We will consider utilizing partial indexes for both unconstrained and 243 constrained scans. Therefore, we'll need to modify both the `GenerateIndexScans` 244 and `GenerateConstrainedScans` exploration rules (or make new, similar rules). 245 246 In addition, we'll need to update exploration rules for zig-zag joins and 247 inverted index scans. 248 249 We'll remove redundant filters from the expression when generating a scan over a 250 partial index. For example: 251 252 ```sql 253 CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, units_in_stock INT) 254 CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000 255 256 SELECT * FROM products WHERE price > 20 AND units_sold > 1000 AND units_in_stock > 0 257 ``` 258 259 When generating the constrained scan over `idx1`, the `units_sold > 1000` filter 260 can be removed from the outer `Select`, such that only the `units_in_stock > 0` 261 filter remains. 262 263 Only conjuncted filter expressions that _exactly_ match the 264 predicate expression can be removed. For example, a filter expression 265 `units_sold > 1200` could not be removed. This filter would remain and be 266 applied after the scan in order to remove any rows returned by the scan with 267 `units_sold` between `1000` and `1200`. 268 269 ## Statistics 270 271 The statistics builder must take into account the predicate expression, in 272 addition to the filter expression, when generating statistics for a partial 273 index scan. This is because the number of rows examined via a partial index scan 274 is dependent on the predicate expression. 275 276 For example, consider the following table, indexes, and query: 277 278 ```sql 279 CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, type TEXT) 280 CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000 281 CREATE INDEX idx2 ON products (price) WHERE units_sold > 1000 AND type = 'toy' 282 283 SELECT COUNT(*) FROM products WHERE units_sold > 1000 AND type = 'toy' AND price > 20 284 ``` 285 286 A scan on `idx1` will scan `[/1001 - ]`. A scan on on `idx2` will have the same 287 scan, `[/1001 - ]`, but will examine fewer rows - only those where `type = 'toy'`. 288 Therefore, the optimizer cannot rely solely on the scan constraints to determine 289 the number of rows returned from scanning a partial index. It must also take 290 into account the selectivity of the predicate to correctly determine that 291 scanning `idx2` is a lower-cost plan than scanning `idx1`. 292 293 294 ["The Case For Partial Indexes"](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf) 295 details how to estimate the number of rows examined by a partial index scan, 296 based on the selectivity of the filter and predicate expressions. This can be 297 used as a starting point for making adjustments to the statistics builder. 298 299 The statistics builder must account for the special case when predicate 300 expressions include columns that are indexed. For example, consider the 301 following table, index, and query. 302 303 ```sql 304 CREATE TABLE products (id INT PRIMARY KEY, units_sold INT) 305 CREATE INDEX idx ON products (units_sold) WHERE units_sold > 100 306 307 SELECT COUNT(*) FROM products WHERE units_sold > 200 308 ``` 309 310 In this case, the constraint spans of the predicate and query filter are 311 `[/101 - ]` and `[/201 - ]`, respectively. Because the constraints apply to the 312 same column, the selectivity of each is _not_ independent. The number of rows 313 examined is approximately the total number of rows in the table, multiplied by 314 the selectivity of `[/201 - ]`. It would be inaccurate to instead estimate the 315 number of rows examined as the total number of rows in the table, multiplied by 316 the selectivity of `[/201 - ]` and the selectivity of `[/101 - ]`. 317 318 ## Mutation 319 320 Partial indexes only index rows that satisfy the partial index's predicate 321 expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and 322 `DELETE`s to a table must update the partial index in the event that they change 323 the candidacy of a row. Partial indexes must also be updated if an `UPDATE`d row 324 matches the predicate both before and after the update, and the value of the 325 indexed columns change. 326 327 In order for the execution engine to determine when a partial index needs to be 328 updated, the optimizer will project boolean columns that represent whether or not 329 partial indexes will be updated. This will operate similarly to `CHECK` 330 constraint verification. 331 332 ### Insert 333 334 If the row being inserted satisfies the predicate, write to the partial index. 335 336 ### Delete 337 338 If the row being deleted satisfies the predicate, delete it from the partial 339 index. 340 341 ### Updates 342 343 Updates will require two columns to be projected for each partial index. The 344 first is true if the old version of the row is in the index and needs to be 345 deleted. The second is true if the new version of the row needs to be written to 346 the index. 347 348 Consider the following table of possibilities, where: 349 350 * `r` is the version of the row before the update 351 * `r'` is the version of the row after the update 352 * `pred_match(r)` is `true` when `r` matches the partial index predicate 353 354 | Case | Delete `r` from index | Insert `r'` to index | 355 | ------------------------------------ | --------------------- | -------------------- | 356 | `pred_match(r) AND !pred_match(r')` | `True` | `False` | 357 | `!pred_match(r) AND pred_match(r')` | `False` | `True` | 358 | `pred_match(r) AND pred_match(r')`* | `True` | `True` | 359 | `!pred_match(r) AND !pred_match(r')` | `False` | `False` | 360 361 362 *Note that in the case that the row was already in the partial index and will 363 remain in the partial index after the update, the index only needs to be updated 364 (delete `r` and insert `r'`) if the value of the indexed columns changes. If the 365 value of the indexed columns is not changing, there is no need to update the 366 index. 367 368 ### Primary Key Changes 369 370 If a primary key change occurs, the partial index will need to be rewritten so 371 that the values in the index store the new primary key. This is similar to other 372 secondary indexes, only that the added complexity of checking if rows belong in 373 the partial index must be considered. 374 375 # Alternatives considered 376 377 ## Disallow `OR` operators in partial index predicates 378 379 **This alternative is not being considered because it would make CRDB partial 380 indexes incompatible with Postgres's partial indexes.** 381 382 Testing for predicate implication could be simplified by disallowing `OR` 383 operators in partial index predicates. A predicate expression without `OR` can 384 always be represented by a constraint. Therefore, to test if a filter implies 385 the predicate, we simply check if any of the filter's constraints contain the 386 predicate constraint. Walking the expression trees would not be required. 387 388 [SQL Server imposes this limitation for its form of partial 389 indexes](https://docs.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql?view=sql-server-ver15). 390 Such an expression could always be represented by a constraint. Therefore, to 391 test if a filter implies the predicate, we simply check if any of the filter's 392 constraints contain the predicate constraint. 393 394 Note that the `IN` operator would still be allowed, which provides a form of 395 disjunction. The `IN` operator can easily be supported because it represents a 396 disjunction on only one column, which a constraint _can_ represent. 397 398 # Work Items 399 400 Below is a list of the steps (PRs) to implement partial indexes, roughly 401 ordered. 402 403 - [ ] Add partial index predicate to internal index data structures, add parser 404 support for `WHERE <predicate>`, add a cluster flag for gating this 405 defaulted to "off" 406 - [ ] Add simple equality implication check to optimizer when generating index 407 scans, in GenerateIndexScans. 408 - [ ] Same, for GenerateConstrainedScans. 409 - [ ] Add support for updating partial indexes on inserts. 410 - [ ] Add support for updating partial indexes on deletes. 411 - [ ] Add support for updating partial indexes on updates and upserts. 412 - [ ] Add support for backfilling partial indexes. 413 - [ ] Update the statistics builder to account for the selectivity of the partial index 414 predicate. 415 - [ ] Add more advanced implication logic for filter and predicate expressions. 416 - [ ] Add support in other index exploration rules: 417 - [ ] GenerateInvertedIndexScans 418 - [ ] GenerateZigZagJoin 419 - [ ] GenerateInvertedIndexZigZagJoin 420 - [ ] Add support for partitioned partial indexes 421 - [ ] Add support for using partial indexes in Lookup Joins 422 - [ ] Consider using partial indexes for auto-generated indexes used for foreign 423 keys. 424 - [ ] [Stretch goal] Add support for `ON CONFLICT WHERE [index_predicate] DO 425 ...` for identifying conflict behavior for uniquer partial indexes. 426 - More info in the [Postgres 427 docs](https://www.postgresql.org/docs/9.5/sql-insert.html#SQL-ON-CONFLICT) 428 and [this blog 429 post](https://medium.com/@betakuang/why-postgresqls-on-conflict-cannot-find-my-partial-unique-index-552327b85e1) 430 431 # Resources 432 433 - [Postgres partial indexes documentation](https://www.postgresql.org/docs/current/indexes-partial.html) 434 - [Postgres CREATE INDEX documentation](https://www.postgresql.org/docs/12/sql-createindex.html) 435 - [Postgres predicate test source code](https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/predtest.c) 436 - ["The Case For Partial Indexes", Michael Stonebraker](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf) 437 - [Use the Index Luke - Partial Indexes](https://use-the-index-luke.com/sql/where-clause/partial-and-filtered-indexes)