github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171120_scrub_index_and_physical_implementation.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171120_scrub_index_and_physical_implementation.md (about)

1 - Feature Name: SCRUB index and physical check implementation
2 - Status: completed
3 - Start Date: 2017-10-11
4 - Authors: Joey Pereira
5 - RFC PR: [#19327](https://github.com/cockroachdb/cockroach/pull/19327)
6 - Cockroach Issue: [#10425](https://github.com/cockroachdb/cockroach/issues/10425) (Consistency checking), [#19221](https://github.com/cockroachdb/cockroach/issues/19221) (Using distSQL to do index checks)
7
8 # Summary
9
10 This RFC presents a method of running a check on a secondary index in an
11 efficient manner, as well as other physical checks which are closely in
12 implementation.
13
14 This check can be run through the SQL statement `SCRUB`, using the
15 options to specifically check indexes or check the physical data of the
16 database.
17
18
19 # Motivation
20
21 See the [SCRUB command RFC] for motivation.
22
23 In addition to the motivations for the check, this work may also benefit
24 schema changes. The last unresolved question of the
25 [online schema change RFC] mentions a method of distributing backfill
26 work. While this is currently done through distSQL, there are problems
27 where it's currently unable to backfill an index at a specific
28 timestamp. This has become prevalent as it may be the source of issues
29 such as [#18533].
30
31 The methods for collecting index data introduced in this RFC are similar
32 to that which is needed to create an alternative backfilling mechanism.
33
34 [#18533]: https://github.com/cockroachdb/cockroach/issues/18533
35 [SCRUB command RFC]: https://github.com/cockroachdb/cockroach/issues/18675
36 [online schema change RFC]: https://github.com/cockroachdb/cockroach/blob/94b3f764ead019c529f14c7aab64724991f3ac67/docs/RFCS/20151014_online_schema_change.md
37
38
39 # Guide-level explanation
40
41 After implementing the methods proposed by this RFC, the `SCRUB` command
42 will have the means to check for physical data errors and for index
43 errors. In particular, the errors we are checking for are:
44 - `missing_secondary_index`
45 - `dangling_secondary_index`
46 - `invalid_encoding`
47 - `invalid_composite_encoding`
48 - `noncanonical_encoding`
49
50 The first two are index errors. Missing secondary indexes occur when an
51 index entry is not correctly made or updated with the primary k/v.
52 Conversely, a dangling index entries occur if the primary key referenced
53 in the index k/v is not valid. This happens if the primary k/v being
54 incorrectly deleted while the index k/v remains, in addition to the
55 reasons for missing index entries.
56
57 To detect missing index entries, we need to do a full scan of the
58 primary k/v. Then for each primary k/v, we expect to find a
59 corresponding index entry. If no index entry is found, we have a missing
60 index entry. This does not capture any dangling index entries though, as
61 we are only looking up index entries which have a valid primary key
62 reference. In order to find dangling indexes, we need to do a similar
63 process where we do a full scan of the secondary index k/v.
64
65 As a result, a full outer join is a natural solution for this problem.
66 Taking this approach, we limit the scans done to 2, as opposed to
67 potentially scanning both the primary k/v and the secondary index k/v
68 and doing expensive point-lookups to the other set, twice.
69
70 Lastly, we want to propagate the errors to be the SQL results of
71 `SCRUB`, including the information of:
72 - Database and table involved
73 - Error type
74 - Primary key and secondary index value involved
75 - Potentially the raw key value
76
77 The following changes are introduced for in order to do the entirety of
78 the index check:
79
80 - Add logic plan creation (`scrubNode`[0] initialization) to construct a
81 logical plan for checking an index, then plan and run it through the
82 distSQL execution engine. This plan will accomplish the full scan and
83 join in one go.
84
85 The solution instead constructs multiple logical plans
86 upon execution of our original plan, and proceeds to run them.
87
88 The latter three errors related to encoding happen due to errors in the
89 row data or our functions manipulating the row data. These can be tested
90 just by doing a full scan of the data to check (primary k/v and index
91 entry k/v). Similarly to the above, we will also want the error results
92 to be propagated up as the SQL results.
93
94 The following changes are introduced for running physical checks:
95
96 - Transform the `scrubNode`[0] a logical plan that fully encapsulates
97 the logic to scan relevant tables. This plan will run through distSQL,
98 similar to any regular plan.
99
100 - Add a `IsCheck` enum to `scanNode` and plumb it through the
101 `TableReader` for `RowFetcher` initialization during distSQL planning.
102 This enum would only be meaningfully set during planning of `SCRUB`
103 when doing a physical check.
104
105 - Modify the `RowFetcher` to conditionally run check code in existing
106 row fetching code paths, using the `IsCheck` indicator. For example,
107 we want to return an error instead of panic when NULL is encountered
108 on a non-NULL column.
109
110 - Add a new iterator function to `RowFetcher`, `NextRowWithErrors`, for
111 just scanning for errors. In here we can execute additional checking
112 code on a separate code path than the critical fetching one. This lets
113 us consume check errors instead of row data.
114
115 - Add logic to `TableReader` to consume `NextRowWithErrors` and handle the
116 difference between wanted rows and returned rows (check error row
117 type). This may potentially require a new `TableChecker`. With this
118 change, we are able to retrieve the error results desired.
119
120 [0]: `scrubNode` is the root planning node when using the `SCRUB`
121 command.
122
123
124 # Reference-level explanation
125
126 The details of implementation are broken down into a few parts. NB:
127 the optimizations section is just ideation. The goal of it is to discuss
128 the feasibility and time-effort of the possibilities.
129
130 1. Index checks and how they will be implemented.
131
132 2. Physical checks and how they will be implemented.
133
134 3. Generating and executing the logical plan in the distSQL execution engine
135
136 4. Optimizations
137
138 4.a. Multiple simultaneous index checks.
139
140 4.b. Executing index and physical checks simultaneously.
141
142
143 ## 1. Index checks and how they will be implemented.
144
145 Checking the index can be expressed through a rather simple logical SQL
146 plan. The final result will return all erroneous index entries -- both
147 dangling references and missing entries. In short, this can be broken
148 down into a few stages of processing:
149
150 1) Scan through the primary k/v to fetch relevant index data[1] and
151 filter rows where the secondary index value is `NULL`.
152 2) Scan through the secondary k/v to fetch all data stored in the index.
153 3) Execute a natural full outer join on both streams to pair primary k/v
154 with their secondary index k/v. This filters any entry from both
155 streams that don't have a corresponding entry in the other stream.
156
157 For example, if we have the following schema:
158
159 ```sql
160 CREATE TABLE t1 (
161 pkey int primary key,
162 name int,
163 INDEX name_idx (id ASC),
164 )
165 ```
166
167 Then checking the indexes can be done through the logical plan
168 represented by:
169
170 ```sql
171 SELECT p.pkey, p.name, s.pkey, s.name
172 FROM
173 (SELECT pkey, name FROM t1@{FORCE_INDEX=primary,NO_INDEX_JOIN} ORDER BY pkey, name) AS p
174 FULL OUTER JOIN
175 (SELECT pkey, name FROM t1@{FORCE_INDEX=name_idx,NO_INDEX_JOIN} ORDER BY pkey, name) AS s
176 ON
177 p.pkey = s.pkey OR (p.pkey IS NULL AND s.pkey IS NULL)) AND
178 (p.name = s.name OR (p.name IS NULL AND s.name IS NULL))
179 WHERE
180 (p.pkey IS NULL AND p.name IS NULL) OR
181 (s.pkey IS NULL AND s.name IS NULL)
182 ```
183
184 In short, this query is:
185 1) Scanning the primary index and the secondary index.
186 2) Ordering both of them by the primary then secondary index columns.
187 This is done to force a distSQL merge join.
188 3) Joining both sides on all of the columns contained by the secondary
189 index key-value pairs.
190 4) Filtering to achieve an anti-join. It looks for when the primary
191 index values are null, as they are non-nullable columns. The first
192 line of the predicate takes rows on the right for the anti-join.
193 The second line of the predicate takes rows on the left for the
194 anti-join.
195
196 Because this is an anti-join, the results are as follows:
197 - If any primary index column on the left is NULL, that means the
198 right columns are present. This is because of the invariant that
199 primary index columns are never null.
200 - Otherwise, the left columns is present.
201
202 Now, a problem comes up because of the join predicate. The query above
203 will be run as a hash joiner, with only one bucket. This will be very
204 slow. As a workaround, the `p.col = s.col OR (p.col IS NULL AND s.col IS NULL)`
205 predicates are replaced with just `p.col = s.col` so that the query can
206 be run using a merge joiner, hence the explicit orders and step 2 in the
207 query. The distSQL plan generated from this query has the
208 MergeJoinerSpec.NullEquality set to true, to make NULL = NULL evaluate
209 to true, making the two predicates equivilant while allowing for a far
210 more efficient execution/
211
212 [1]: Relevant index data includes the columns, extra columns, and store
213 columns of the index.
214
215 ## 2. Physical checks and their implementation
216
217 The **physical table** is the underlying k/v behind tables. This is
218 where we find primary and secondary indexes (as distinct keys),
219 groupings of certain data (column families), how index data is organized
220 (index storing conditions), and the encoding of the data.
221
222 Most of the checks that involve the physical table are encoding and
223 related data validity. This data is already partially checked during
224 scanning in `RowFetcher`. Because of this, adding checks is a matter of
225 a few additive changes, and handling the error to instead return row
226 results.
227
228 The checks we can implement in `RowFetcher` include:
229 - Tables using [FAMILY][] have the data organized into families
230 correctly
231 - Invalid data encodings
232 - Non-canonical encodings (i.e. data round-trips)
233 - [Composite encodings][] are valid
234 - Key encoding correctly reflects the ordering of the decoded values
235 - `NOT NULL` constraint violation (not physical table, but checked at
236 this layer)
237
238 The underlying scanning is in the `RowFetcher`. Inside the current
239 `RowFetcher` the method `processKv` is responsible for encoding
240 specific work as it iterates over the k/v, which is where the checks
241 will be added, including:
242 - Check composite encodings are correct and round-trip
243 - Check key encodings correctly reflect ordering of decoded values
244 - Check the KV pair is well-formed (everything is present, no extra
245 data, checksum matches)
246 - For primary keys, verify that the column family 0 (sentinel) is
247 present
248
249 We can also throttle the speed of the check by modifying the `limitHint`
250 provided to `KVFetcher`, and sleeping between batches.
251
252 Thus, to run the check we just need to do a scan over the whole table,
253 and the returned values will be the `CheckError` as found in the
254 [SCRUB interface RFC]. Ideally, this is executed through the distSQL, as
255 there will be very few results relative to the data scanned.
256
257 Invalid data encoding (and others) already return an `error` while
258 scanning. We want to capture errors in `RowFetcher` that come from
259 any external function calls and wrap them in a `scrub.Error` struct with
260 an error code annotated.
261
262 Those error structs will be captured further upstream inside the
263 TableReader and translated into the row results. In the existing
264 scanning code path, this same error struct will simply be unwrapped.
265
266 For example, consider this excerpt from `RowFetcher.processValueSingle`:
267
268 ```go
269 value, err := UnmarshalColumnValue(rf.alloc, typ, kv.Value)
270 if err != nil {
271 return "", "", err
272 }
273 // Becomes
274 value, err := UnmarshalColumnValue(rf.alloc, typ, kv.Value)
275 if err != nil {
276 return "", "", scrub.NewError(
277 scrub.InvalidEncodingError,
278 err,
279 )
280 }
281 ```
282
283 Another chunk of the work will then be plumbing through the check error
284 results out as the table schema we are scanning will differ from the
285 result rows representing errors. To get around this we must manually
286 construct the distSQL physical plan. This will also take some extra code
287 to wrap the call from TableReader to RowFetcher where it will transform
288 any caught errors.
289
290 [SCRUB interface RFC]: https://github.com/cockroachdb/cockroach/pull/18675
291
292 ## 3. Generating and executing the logical plan in the distSQL execution engine
293
294 During local engine execution, we will generate the necessary check
295 plans, then execute the plans in sequence on the distSQL execution
296 engine. These plans will scan the data in a distributed fashion and the
297 only data leaving the nodes are any error results. This strategy is
298 similar to schema change index backfilling.
299
300 The advantage of this over the alternative method [3], creating the full
301 logical plan during query planning, is that we can easily run the checks
302 sequentially. We can block waiting for the distSQL results of each check
303 before running the next check. This also pulls the work of aggregating
304 results from the distSQL execution engine to simpler Go code that
305 accumulates the results.
306
307
308 ## 4. Optimizations
309
310 Note that this section is largely just the ideas behind optimizations
311 and are not required to be resolved. The goal of this section is to
312 gauge feasibility, and whether or not it's worth it to go down those
313 routes.
314
315
316 ### 4.a. Multiple simultaneous index checks
317
318 A challenge here is producing both check errors and the necessary data
319 to run index checks simultaneously. It is not currently known if
320 multiple different schemas can be streamed out of one table reader.
321
322 In the `RowFetcher`, this may look like a `NextRowOrCheckError` iterator
323 that may return either row data or check error data, and we emit those
324 to the corresponding outputs.
325
326 A workaround may be to have an intermediary processor for each mirrored
327 stream which filters out only the columns desired for each consumer --
328 either a specific index check (a joiner), or the final result errors (in
329 final aggregation).
330
331 Alternatively, a crazy idea may modify the RowFetcher or TableReader to
332 produce two separate schemas, branching off the idea of RowFetchers
333 made for the adjustments for interleaved tables.
334
335
336 ### 4.b. Executing index and physical checks simultaneously
337
338 While checking multiple indexes sequentially is discussed above, but a
339 problem is that a naive solution involves doing `N` scans of the primary
340 k/vs for checking `N` indexes. The following approaches are both
341 non-trivial changes, so this idea is largely suggested as an
342 optimization for future work.
343
344 The extra scans can be avoided if we scan all the necessary data for all
345 indexes being check in one table reader, then use a mirror router to
346 mirror the data to multiple join consumers. Unfortunately, this would
347 require making one of the changes:
348
349 - Manual surgery of a distSQL physical plan to independently plan each
350 index check, then combine them with:
351
352 - Slicing out the primary k/v scan in each, replacing it with a
353 single mirrored scan that fetches all the data.
354
355 - Remove each plan's final aggregator, and have them all stream to a
356 single new final aggregator.
357
358 - Or, adding the mechanisms to be able to do a `UNION` of all the
359 results and then elide redundant table readers based on processors'
360 column requirements. This is the same final aggregation `UNION`
361 mentioned in section 3.a. but we also need to solve the problem of
362 scanning the same k/v data multiple times.
363
364
365 # Rationale and Alternatives
366
367 ## [1] Use a `JOIN READER` aggregator for index checking
368
369 Instead of using a `JOINER` aggregator, we could use a `JOIN READER`.
370 How this would work is that instead of the 3 stages introduced in 1.a,
371 the process for checking would involve 2 steps:
372 1) Scan the primary k/v for a table.
373 2) Use a `JOIN READER` aggregator that does point-lookups against the
374 secondary index.
375
376 This has two major problems:
377 - Producing lots of RPCs for the secondary index entries, potentially
378 being a performance problem.
379 - Does not check the secondary index for dangling entries. This requires
380 a second scan which does the reverse point-lookup.
381
382
383 ## [2] Do point-lookups directly during RowFetching for index checking
384
385 This is by far the easiest method to do as it only involves:
386 1) Collect the required secondary index data. The first code path that
387 accesses required data is in `RowFetcher`.
388 2) Do point-lookups for the secondary index entry, and compare.
389
390 Step 1 is ideally done in the `RowFetcher` as this is also where we can
391 avoid passing decoded data further upstream, when we only need check
392 errors.
393
394 Step 2 can be optimized by batching the k/v lookups, which would be a
395 trivial mechanism to add inside `RowFetcher`. In this case, having an
396 alternative `NextRow` called `NextRowWithErrors` makes more sense to not
397 add this code to the critical code path.
398
399 A [proof of concept] has been done where this does not batch lookups or
400 propagate errors to the client. Errors encountered are just logged.
401
402 This is very similar to the alternative proposal [1]. This approach has
403 the same problems.
404
405 [proof of concept]: https://github.com/lego/cockroach/blob/0f022a8e0ef97d8233ccc8aa31466090ef3dee7e/pkg/sql/sqlbase/rowfetcher.go#L651-L673
406
407 ## [3] Creating the full logical plan during query planning -- transforming SCRUB planNodes into the check logic.
408
409 As an alternative to section 3, during query planning we can transform
410 the scrub planNode into a plan that would execute all of the check
411 logic. The goal of this would be to use existing planNodes to express
412 the check logic, as some of the checks such as index checks can be
413 expressed completely through SQL as seen in section 1. This was not
414 picked as the strategy to implement because there were more challenges
415 with this approach.
416
417 For this option, during the planning phase, we will expand a `Scrub`
418 planning node to include plans that encompass all of the checking work.
419 Because of this, the plan will not need to run additional plan and
420 execution phases.
421
422 This may be easier to implement as we don't need to be concerned with
423 creating a new distSQLPlanner, it will be limited in a few ways. If any
424 part of the `SCRUB` involves plan nodes that cannot be executed in the
425 distSQL execution engine, we have no choice but to implement those
426 features or come up with a workaround.
427
428 One example of that problem is the need for a form of `UNION` or special
429 case workaround to aggregate the error results for multiple index
430 checks. To do this, we will need to add some form of `UNION` or special
431 case workaround to do the aggregation and sequential execution step for
432 this.
433
434 A special case where we have new plan nodes specifically to union the
435 errors might look like:
436
437 ```go
438 type ScrubCheckNode {
439 // Sub-plan which checks a particular aspect.
440 Checks []ScrubCheck
441 }
442
443 type ScrubCheck {
444 Check CheckType
445 // Plan for running the check.
446 plan *plan
447 }
448 ```
449
450 Then during distSQL physical plan creation, we will plan each
451 `ScrubCheck.plan` and route their streams into a final aggregator. This
452 may be a problem though, as this will run all checks in parallel and not
453 in sequence, which will adversely affect foreground traffic.
454
455 Lastly, we may need to have an alternative to a `TableReader`, call it
456 `TableChecker`, where we call the iterator `NextRowWithErrors` instead of
457 `NextRow` to produce all physical errors.
458
459 # Unresolved questions