github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171120_scrub_index_and_physical_implementation.md (about) 1 - Feature Name: SCRUB index and physical check implementation 2 - Status: completed 3 - Start Date: 2017-10-11 4 - Authors: Joey Pereira 5 - RFC PR: [#19327](https://github.com/cockroachdb/cockroach/pull/19327) 6 - Cockroach Issue: [#10425](https://github.com/cockroachdb/cockroach/issues/10425) (Consistency checking), [#19221](https://github.com/cockroachdb/cockroach/issues/19221) (Using distSQL to do index checks) 7 8 # Summary 9 10 This RFC presents a method of running a check on a secondary index in an 11 efficient manner, as well as other physical checks which are closely in 12 implementation. 13 14 This check can be run through the SQL statement `SCRUB`, using the 15 options to specifically check indexes or check the physical data of the 16 database. 17 18 19 # Motivation 20 21 See the [SCRUB command RFC] for motivation. 22 23 In addition to the motivations for the check, this work may also benefit 24 schema changes. The last unresolved question of the 25 [online schema change RFC] mentions a method of distributing backfill 26 work. While this is currently done through distSQL, there are problems 27 where it's currently unable to backfill an index at a specific 28 timestamp. This has become prevalent as it may be the source of issues 29 such as [#18533]. 30 31 The methods for collecting index data introduced in this RFC are similar 32 to that which is needed to create an alternative backfilling mechanism. 33 34 [#18533]: https://github.com/cockroachdb/cockroach/issues/18533 35 [SCRUB command RFC]: https://github.com/cockroachdb/cockroach/issues/18675 36 [online schema change RFC]: https://github.com/cockroachdb/cockroach/blob/94b3f764ead019c529f14c7aab64724991f3ac67/docs/RFCS/20151014_online_schema_change.md 37 38 39 # Guide-level explanation 40 41 After implementing the methods proposed by this RFC, the `SCRUB` command 42 will have the means to check for physical data errors and for index 43 errors. In particular, the errors we are checking for are: 44 - `missing_secondary_index` 45 - `dangling_secondary_index` 46 - `invalid_encoding` 47 - `invalid_composite_encoding` 48 - `noncanonical_encoding` 49 50 The first two are index errors. Missing secondary indexes occur when an 51 index entry is not correctly made or updated with the primary k/v. 52 Conversely, a dangling index entries occur if the primary key referenced 53 in the index k/v is not valid. This happens if the primary k/v being 54 incorrectly deleted while the index k/v remains, in addition to the 55 reasons for missing index entries. 56 57 To detect missing index entries, we need to do a full scan of the 58 primary k/v. Then for each primary k/v, we expect to find a 59 corresponding index entry. If no index entry is found, we have a missing 60 index entry. This does not capture any dangling index entries though, as 61 we are only looking up index entries which have a valid primary key 62 reference. In order to find dangling indexes, we need to do a similar 63 process where we do a full scan of the secondary index k/v. 64 65 As a result, a full outer join is a natural solution for this problem. 66 Taking this approach, we limit the scans done to 2, as opposed to 67 potentially scanning both the primary k/v and the secondary index k/v 68 and doing expensive point-lookups to the other set, twice. 69 70 Lastly, we want to propagate the errors to be the SQL results of 71 `SCRUB`, including the information of: 72 - Database and table involved 73 - Error type 74 - Primary key and secondary index value involved 75 - Potentially the raw key value 76 77 The following changes are introduced for in order to do the entirety of 78 the index check: 79 80 - Add logic plan creation (`scrubNode`[0] initialization) to construct a 81 logical plan for checking an index, then plan and run it through the 82 distSQL execution engine. This plan will accomplish the full scan and 83 join in one go. 84 85 The solution instead constructs multiple logical plans 86 upon execution of our original plan, and proceeds to run them. 87 88 The latter three errors related to encoding happen due to errors in the 89 row data or our functions manipulating the row data. These can be tested 90 just by doing a full scan of the data to check (primary k/v and index 91 entry k/v). Similarly to the above, we will also want the error results 92 to be propagated up as the SQL results. 93 94 The following changes are introduced for running physical checks: 95 96 - Transform the `scrubNode`[0] a logical plan that fully encapsulates 97 the logic to scan relevant tables. This plan will run through distSQL, 98 similar to any regular plan. 99 100 - Add a `IsCheck` enum to `scanNode` and plumb it through the 101 `TableReader` for `RowFetcher` initialization during distSQL planning. 102 This enum would only be meaningfully set during planning of `SCRUB` 103 when doing a physical check. 104 105 - Modify the `RowFetcher` to conditionally run check code in existing 106 row fetching code paths, using the `IsCheck` indicator. For example, 107 we want to return an error instead of panic when NULL is encountered 108 on a non-NULL column. 109 110 - Add a new iterator function to `RowFetcher`, `NextRowWithErrors`, for 111 just scanning for errors. In here we can execute additional checking 112 code on a separate code path than the critical fetching one. This lets 113 us consume check errors instead of row data. 114 115 - Add logic to `TableReader` to consume `NextRowWithErrors` and handle the 116 difference between wanted rows and returned rows (check error row 117 type). This may potentially require a new `TableChecker`. With this 118 change, we are able to retrieve the error results desired. 119 120 [0]: `scrubNode` is the root planning node when using the `SCRUB` 121 command. 122 123 124 # Reference-level explanation 125 126 The details of implementation are broken down into a few parts. NB: 127 the optimizations section is just ideation. The goal of it is to discuss 128 the feasibility and time-effort of the possibilities. 129 130 1. Index checks and how they will be implemented. 131 132 2. Physical checks and how they will be implemented. 133 134 3. Generating and executing the logical plan in the distSQL execution engine 135 136 4. Optimizations 137 138 4.a. Multiple simultaneous index checks. 139 140 4.b. Executing index and physical checks simultaneously. 141 142 143 ## 1. Index checks and how they will be implemented. 144 145 Checking the index can be expressed through a rather simple logical SQL 146 plan. The final result will return all erroneous index entries -- both 147 dangling references and missing entries. In short, this can be broken 148 down into a few stages of processing: 149 150 1) Scan through the primary k/v to fetch relevant index data[1] and 151 filter rows where the secondary index value is `NULL`. 152 2) Scan through the secondary k/v to fetch all data stored in the index. 153 3) Execute a natural full outer join on both streams to pair primary k/v 154 with their secondary index k/v. This filters any entry from both 155 streams that don't have a corresponding entry in the other stream. 156 157 For example, if we have the following schema: 158 159 ```sql 160 CREATE TABLE t1 ( 161 pkey int primary key, 162 name int, 163 INDEX name_idx (id ASC), 164 ) 165 ``` 166 167 Then checking the indexes can be done through the logical plan 168 represented by: 169 170 ```sql 171 SELECT p.pkey, p.name, s.pkey, s.name 172 FROM 173 (SELECT pkey, name FROM t1@{FORCE_INDEX=primary,NO_INDEX_JOIN} ORDER BY pkey, name) AS p 174 FULL OUTER JOIN 175 (SELECT pkey, name FROM t1@{FORCE_INDEX=name_idx,NO_INDEX_JOIN} ORDER BY pkey, name) AS s 176 ON 177 p.pkey = s.pkey OR (p.pkey IS NULL AND s.pkey IS NULL)) AND 178 (p.name = s.name OR (p.name IS NULL AND s.name IS NULL)) 179 WHERE 180 (p.pkey IS NULL AND p.name IS NULL) OR 181 (s.pkey IS NULL AND s.name IS NULL) 182 ``` 183 184 In short, this query is: 185 1) Scanning the primary index and the secondary index. 186 2) Ordering both of them by the primary then secondary index columns. 187 This is done to force a distSQL merge join. 188 3) Joining both sides on all of the columns contained by the secondary 189 index key-value pairs. 190 4) Filtering to achieve an anti-join. It looks for when the primary 191 index values are null, as they are non-nullable columns. The first 192 line of the predicate takes rows on the right for the anti-join. 193 The second line of the predicate takes rows on the left for the 194 anti-join. 195 196 Because this is an anti-join, the results are as follows: 197 - If any primary index column on the left is NULL, that means the 198 right columns are present. This is because of the invariant that 199 primary index columns are never null. 200 - Otherwise, the left columns is present. 201 202 Now, a problem comes up because of the join predicate. The query above 203 will be run as a hash joiner, with only one bucket. This will be very 204 slow. As a workaround, the `p.col = s.col OR (p.col IS NULL AND s.col IS NULL)` 205 predicates are replaced with just `p.col = s.col` so that the query can 206 be run using a merge joiner, hence the explicit orders and step 2 in the 207 query. The distSQL plan generated from this query has the 208 MergeJoinerSpec.NullEquality set to true, to make NULL = NULL evaluate 209 to true, making the two predicates equivilant while allowing for a far 210 more efficient execution/ 211 212 [1]: Relevant index data includes the columns, extra columns, and store 213 columns of the index. 214 215 ## 2. Physical checks and their implementation 216 217 The **physical table** is the underlying k/v behind tables. This is 218 where we find primary and secondary indexes (as distinct keys), 219 groupings of certain data (column families), how index data is organized 220 (index storing conditions), and the encoding of the data. 221 222 Most of the checks that involve the physical table are encoding and 223 related data validity. This data is already partially checked during 224 scanning in `RowFetcher`. Because of this, adding checks is a matter of 225 a few additive changes, and handling the error to instead return row 226 results. 227 228 The checks we can implement in `RowFetcher` include: 229 - Tables using [FAMILY][] have the data organized into families 230 correctly 231 - Invalid data encodings 232 - Non-canonical encodings (i.e. data round-trips) 233 - [Composite encodings][] are valid 234 - Key encoding correctly reflects the ordering of the decoded values 235 - `NOT NULL` constraint violation (not physical table, but checked at 236 this layer) 237 238 The underlying scanning is in the `RowFetcher`. Inside the current 239 `RowFetcher` the method `processKv` is responsible for encoding 240 specific work as it iterates over the k/v, which is where the checks 241 will be added, including: 242 - Check composite encodings are correct and round-trip 243 - Check key encodings correctly reflect ordering of decoded values 244 - Check the KV pair is well-formed (everything is present, no extra 245 data, checksum matches) 246 - For primary keys, verify that the column family 0 (sentinel) is 247 present 248 249 We can also throttle the speed of the check by modifying the `limitHint` 250 provided to `KVFetcher`, and sleeping between batches. 251 252 Thus, to run the check we just need to do a scan over the whole table, 253 and the returned values will be the `CheckError` as found in the 254 [SCRUB interface RFC]. Ideally, this is executed through the distSQL, as 255 there will be very few results relative to the data scanned. 256 257 Invalid data encoding (and others) already return an `error` while 258 scanning. We want to capture errors in `RowFetcher` that come from 259 any external function calls and wrap them in a `scrub.Error` struct with 260 an error code annotated. 261 262 Those error structs will be captured further upstream inside the 263 TableReader and translated into the row results. In the existing 264 scanning code path, this same error struct will simply be unwrapped. 265 266 For example, consider this excerpt from `RowFetcher.processValueSingle`: 267 268 ```go 269 value, err := UnmarshalColumnValue(rf.alloc, typ, kv.Value) 270 if err != nil { 271 return "", "", err 272 } 273 // Becomes 274 value, err := UnmarshalColumnValue(rf.alloc, typ, kv.Value) 275 if err != nil { 276 return "", "", scrub.NewError( 277 scrub.InvalidEncodingError, 278 err, 279 ) 280 } 281 ``` 282 283 Another chunk of the work will then be plumbing through the check error 284 results out as the table schema we are scanning will differ from the 285 result rows representing errors. To get around this we must manually 286 construct the distSQL physical plan. This will also take some extra code 287 to wrap the call from TableReader to RowFetcher where it will transform 288 any caught errors. 289 290 [SCRUB interface RFC]: https://github.com/cockroachdb/cockroach/pull/18675 291 292 ## 3. Generating and executing the logical plan in the distSQL execution engine 293 294 During local engine execution, we will generate the necessary check 295 plans, then execute the plans in sequence on the distSQL execution 296 engine. These plans will scan the data in a distributed fashion and the 297 only data leaving the nodes are any error results. This strategy is 298 similar to schema change index backfilling. 299 300 The advantage of this over the alternative method [3], creating the full 301 logical plan during query planning, is that we can easily run the checks 302 sequentially. We can block waiting for the distSQL results of each check 303 before running the next check. This also pulls the work of aggregating 304 results from the distSQL execution engine to simpler Go code that 305 accumulates the results. 306 307 308 ## 4. Optimizations 309 310 Note that this section is largely just the ideas behind optimizations 311 and are not required to be resolved. The goal of this section is to 312 gauge feasibility, and whether or not it's worth it to go down those 313 routes. 314 315 316 ### 4.a. Multiple simultaneous index checks 317 318 A challenge here is producing both check errors and the necessary data 319 to run index checks simultaneously. It is not currently known if 320 multiple different schemas can be streamed out of one table reader. 321 322 In the `RowFetcher`, this may look like a `NextRowOrCheckError` iterator 323 that may return either row data or check error data, and we emit those 324 to the corresponding outputs. 325 326 A workaround may be to have an intermediary processor for each mirrored 327 stream which filters out only the columns desired for each consumer -- 328 either a specific index check (a joiner), or the final result errors (in 329 final aggregation). 330 331 Alternatively, a crazy idea may modify the RowFetcher or TableReader to 332 produce two separate schemas, branching off the idea of RowFetchers 333 made for the adjustments for interleaved tables. 334 335 336 ### 4.b. Executing index and physical checks simultaneously 337 338 While checking multiple indexes sequentially is discussed above, but a 339 problem is that a naive solution involves doing `N` scans of the primary 340 k/vs for checking `N` indexes. The following approaches are both 341 non-trivial changes, so this idea is largely suggested as an 342 optimization for future work. 343 344 The extra scans can be avoided if we scan all the necessary data for all 345 indexes being check in one table reader, then use a mirror router to 346 mirror the data to multiple join consumers. Unfortunately, this would 347 require making one of the changes: 348 349 - Manual surgery of a distSQL physical plan to independently plan each 350 index check, then combine them with: 351 352 - Slicing out the primary k/v scan in each, replacing it with a 353 single mirrored scan that fetches all the data. 354 355 - Remove each plan's final aggregator, and have them all stream to a 356 single new final aggregator. 357 358 - Or, adding the mechanisms to be able to do a `UNION` of all the 359 results and then elide redundant table readers based on processors' 360 column requirements. This is the same final aggregation `UNION` 361 mentioned in section 3.a. but we also need to solve the problem of 362 scanning the same k/v data multiple times. 363 364 365 # Rationale and Alternatives 366 367 ## [1] Use a `JOIN READER` aggregator for index checking 368 369 Instead of using a `JOINER` aggregator, we could use a `JOIN READER`. 370 How this would work is that instead of the 3 stages introduced in 1.a, 371 the process for checking would involve 2 steps: 372 1) Scan the primary k/v for a table. 373 2) Use a `JOIN READER` aggregator that does point-lookups against the 374 secondary index. 375 376 This has two major problems: 377 - Producing lots of RPCs for the secondary index entries, potentially 378 being a performance problem. 379 - Does not check the secondary index for dangling entries. This requires 380 a second scan which does the reverse point-lookup. 381 382 383 ## [2] Do point-lookups directly during RowFetching for index checking 384 385 This is by far the easiest method to do as it only involves: 386 1) Collect the required secondary index data. The first code path that 387 accesses required data is in `RowFetcher`. 388 2) Do point-lookups for the secondary index entry, and compare. 389 390 Step 1 is ideally done in the `RowFetcher` as this is also where we can 391 avoid passing decoded data further upstream, when we only need check 392 errors. 393 394 Step 2 can be optimized by batching the k/v lookups, which would be a 395 trivial mechanism to add inside `RowFetcher`. In this case, having an 396 alternative `NextRow` called `NextRowWithErrors` makes more sense to not 397 add this code to the critical code path. 398 399 A [proof of concept] has been done where this does not batch lookups or 400 propagate errors to the client. Errors encountered are just logged. 401 402 This is very similar to the alternative proposal [1]. This approach has 403 the same problems. 404 405 [proof of concept]: https://github.com/lego/cockroach/blob/0f022a8e0ef97d8233ccc8aa31466090ef3dee7e/pkg/sql/sqlbase/rowfetcher.go#L651-L673 406 407 ## [3] Creating the full logical plan during query planning -- transforming SCRUB planNodes into the check logic. 408 409 As an alternative to section 3, during query planning we can transform 410 the scrub planNode into a plan that would execute all of the check 411 logic. The goal of this would be to use existing planNodes to express 412 the check logic, as some of the checks such as index checks can be 413 expressed completely through SQL as seen in section 1. This was not 414 picked as the strategy to implement because there were more challenges 415 with this approach. 416 417 For this option, during the planning phase, we will expand a `Scrub` 418 planning node to include plans that encompass all of the checking work. 419 Because of this, the plan will not need to run additional plan and 420 execution phases. 421 422 This may be easier to implement as we don't need to be concerned with 423 creating a new distSQLPlanner, it will be limited in a few ways. If any 424 part of the `SCRUB` involves plan nodes that cannot be executed in the 425 distSQL execution engine, we have no choice but to implement those 426 features or come up with a workaround. 427 428 One example of that problem is the need for a form of `UNION` or special 429 case workaround to aggregate the error results for multiple index 430 checks. To do this, we will need to add some form of `UNION` or special 431 case workaround to do the aggregation and sequential execution step for 432 this. 433 434 A special case where we have new plan nodes specifically to union the 435 errors might look like: 436 437 ```go 438 type ScrubCheckNode { 439 // Sub-plan which checks a particular aspect. 440 Checks []ScrubCheck 441 } 442 443 type ScrubCheck { 444 Check CheckType 445 // Plan for running the check. 446 plan *plan 447 } 448 ``` 449 450 Then during distSQL physical plan creation, we will plan each 451 `ScrubCheck.plan` and route their streams into a final aggregator. This 452 may be a problem though, as this will run all checks in parallel and not 453 in sequence, which will adversely affect foreground traffic. 454 455 Lastly, we may need to have an alternative to a `TableReader`, call it 456 `TableChecker`, where we call the iterator `NextRowWithErrors` instead of 457 `NextRow` to produce all physical errors. 458 459 # Unresolved questions