github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171120_scrub_index_and_physical_implementation.md (about)

     1  - Feature Name: SCRUB index and physical check implementation
     2  - Status: completed
     3  - Start Date: 2017-10-11
     4  - Authors: Joey Pereira
     5  - RFC PR: [#19327](https://github.com/cockroachdb/cockroach/pull/19327)
     6  - Cockroach Issue: [#10425](https://github.com/cockroachdb/cockroach/issues/10425) (Consistency checking), [#19221](https://github.com/cockroachdb/cockroach/issues/19221) (Using distSQL to do index checks)
     7  
     8  # Summary
     9  
    10  This RFC presents a method of running a check on a secondary index in an
    11  efficient manner, as well as other physical checks which are closely in
    12  implementation.
    13  
    14  This check can be run through the SQL statement `SCRUB`, using the
    15  options to specifically check indexes or check the physical data of the
    16  database.
    17  
    18  
    19  # Motivation
    20  
    21  See the [SCRUB command RFC] for motivation.
    22  
    23  In addition to the motivations for the check, this work may also benefit
    24  schema changes. The last unresolved question of the
    25  [online schema change RFC] mentions a method of distributing backfill
    26  work. While this is currently done through distSQL, there are problems
    27  where it's currently unable to backfill an index at a specific
    28  timestamp. This has become prevalent as it may be the source of issues
    29  such as [#18533].
    30  
    31  The methods for collecting index data introduced in this RFC are similar
    32  to that which is needed to create an alternative backfilling mechanism.
    33  
    34  [#18533]: https://github.com/cockroachdb/cockroach/issues/18533
    35  [SCRUB command RFC]: https://github.com/cockroachdb/cockroach/issues/18675
    36  [online schema change RFC]: https://github.com/cockroachdb/cockroach/blob/94b3f764ead019c529f14c7aab64724991f3ac67/docs/RFCS/20151014_online_schema_change.md
    37  
    38  
    39  # Guide-level explanation
    40  
    41  After implementing the methods proposed by this RFC, the `SCRUB` command
    42  will have the means to check for physical data errors and for index
    43  errors. In particular, the errors we are checking for are:
    44  - `missing_secondary_index`
    45  - `dangling_secondary_index`
    46  - `invalid_encoding`
    47  - `invalid_composite_encoding`
    48  - `noncanonical_encoding`
    49  
    50  The first two are index errors. Missing secondary indexes occur when an
    51  index entry is not correctly made or updated with the primary k/v.
    52  Conversely, a dangling index entries occur if the primary key referenced
    53  in the index k/v is not valid. This happens if the primary k/v being
    54  incorrectly deleted while the index k/v remains, in addition to the
    55  reasons for missing index entries.
    56  
    57  To detect missing index entries, we need to do a full scan of the
    58  primary k/v. Then for each primary k/v, we expect to find a
    59  corresponding index entry. If no index entry is found, we have a missing
    60  index entry. This does not capture any dangling index entries though, as
    61  we are only looking up index entries which have a valid primary key
    62  reference. In order to find dangling indexes, we need to do a similar
    63  process where we do a full scan of the secondary index k/v.
    64  
    65  As a result, a full outer join is a natural solution for this problem.
    66  Taking this approach, we limit the scans done to 2, as opposed to
    67  potentially scanning both the primary k/v and the secondary index k/v
    68  and doing expensive point-lookups to the other set, twice.
    69  
    70  Lastly, we want to propagate the errors to be the SQL results of
    71  `SCRUB`, including the information of:
    72  - Database and table involved
    73  - Error type
    74  - Primary key and secondary index value involved
    75  - Potentially the raw key value
    76  
    77  The following changes are introduced for in order to do the entirety of
    78  the index check:
    79  
    80  - Add logic plan creation (`scrubNode`[0] initialization) to construct a
    81    logical plan for checking an index, then plan and run it through the
    82    distSQL execution engine. This plan will accomplish the full scan and
    83    join in one go.
    84  
    85    The solution instead constructs multiple logical plans
    86    upon execution of our original plan, and proceeds to run them.
    87  
    88  The latter three errors related to encoding happen due to errors in the
    89  row data or our functions manipulating the row data. These can be tested
    90  just by doing a full scan of the data to check (primary k/v and index
    91  entry k/v). Similarly to the above, we will also want the error results
    92  to be propagated up as the SQL results.
    93  
    94  The following changes are introduced for running physical checks:
    95  
    96  - Transform the `scrubNode`[0] a logical plan that fully encapsulates
    97    the logic to scan relevant tables. This plan will run through distSQL,
    98    similar to any regular plan.
    99  
   100  - Add a `IsCheck` enum to `scanNode` and plumb it through the
   101    `TableReader` for `RowFetcher` initialization during distSQL planning.
   102    This enum would only be meaningfully set during planning of `SCRUB`
   103    when doing a physical check.
   104  
   105  - Modify the `RowFetcher` to conditionally run check code in existing
   106    row fetching code paths, using the `IsCheck` indicator. For example,
   107    we want to return an error instead of panic when NULL is encountered
   108    on a non-NULL column.
   109  
   110  - Add a new iterator function to `RowFetcher`, `NextRowWithErrors`, for
   111    just scanning for errors. In here we can execute additional checking
   112    code on a separate code path than the critical fetching one. This lets
   113    us consume check errors instead of row data.
   114  
   115  - Add logic to `TableReader` to consume `NextRowWithErrors` and handle the
   116    difference between wanted rows and returned rows (check error row
   117    type). This may potentially require a new `TableChecker`. With this
   118    change, we are able to retrieve the error results desired.
   119  
   120  [0]: `scrubNode` is the root planning node when using the `SCRUB`
   121       command.
   122  
   123  
   124  # Reference-level explanation
   125  
   126  The details of implementation are broken down into a few parts. NB:
   127  the optimizations section is just ideation. The goal of it is to discuss
   128  the feasibility and time-effort of the possibilities.
   129  
   130  1. Index checks and how they will be implemented.
   131  
   132  2. Physical checks and how they will be implemented.
   133  
   134  3. Generating and executing the logical plan in the distSQL execution engine
   135  
   136  4. Optimizations
   137  
   138     4.a. Multiple simultaneous index checks.
   139  
   140     4.b. Executing index and physical checks simultaneously.
   141  
   142  
   143  ## 1. Index checks and how they will be implemented.
   144  
   145  Checking the index can be expressed through a rather simple logical SQL
   146  plan. The final result will return all erroneous index entries -- both
   147  dangling references and missing entries. In short, this can be broken
   148  down into a few stages of processing:
   149  
   150  1) Scan through the primary k/v to fetch relevant index data[1] and
   151     filter rows where the secondary index value is `NULL`.
   152  2) Scan through the secondary k/v to fetch all data stored in the index.
   153  3) Execute a natural full outer join on both streams to pair primary k/v
   154     with their secondary index k/v. This filters any entry from both
   155     streams that don't have a corresponding entry in the other stream.
   156  
   157  For example, if we have the following schema:
   158  
   159  ```sql
   160  CREATE TABLE t1 (
   161     pkey int primary key,
   162     name int,
   163     INDEX name_idx (id ASC),
   164  )
   165  ```
   166  
   167  Then checking the indexes can be done through the logical plan
   168  represented by:
   169  
   170  ```sql
   171  SELECT p.pkey, p.name, s.pkey, s.name
   172  FROM
   173    (SELECT pkey, name FROM t1@{FORCE_INDEX=primary,NO_INDEX_JOIN} ORDER BY pkey, name) AS p
   174  FULL OUTER JOIN
   175    (SELECT pkey, name FROM t1@{FORCE_INDEX=name_idx,NO_INDEX_JOIN} ORDER BY pkey, name) AS s
   176  ON
   177     p.pkey = s.pkey OR (p.pkey IS NULL AND s.pkey IS NULL)) AND
   178    (p.name = s.name OR (p.name IS NULL AND s.name IS NULL))
   179  WHERE
   180    (p.pkey IS NULL AND p.name IS NULL) OR
   181    (s.pkey IS NULL AND s.name IS NULL)
   182  ```
   183  
   184  In short, this query is:
   185  1) Scanning the primary index and the secondary index.
   186  2) Ordering both of them by the primary then secondary index columns.
   187     This is done to force a distSQL merge join.
   188  3) Joining both sides on all of the columns contained by the secondary
   189     index key-value pairs.
   190  4) Filtering to achieve an anti-join. It looks for when the primary
   191     index values are null, as they are non-nullable columns. The first
   192     line of the predicate takes rows on the right for the anti-join.
   193     The second line of the predicate takes rows on the left for the
   194     anti-join.
   195  
   196  Because this is an anti-join, the results are as follows:
   197  - If any primary index column on the left is NULL, that means the
   198    right columns are present. This is because of the invariant that
   199    primary index columns are never null.
   200  - Otherwise, the left columns is present.
   201  
   202  Now, a problem comes up because of the join predicate. The query above
   203  will be run as a hash joiner, with only one bucket. This will be very
   204  slow. As a workaround, the `p.col = s.col OR (p.col IS NULL AND s.col IS NULL)`
   205  predicates are replaced with just `p.col = s.col` so that the query can
   206  be run using a merge joiner, hence the explicit orders and step 2 in the
   207  query. The distSQL plan generated from this query has the
   208  MergeJoinerSpec.NullEquality set to true, to make NULL = NULL evaluate
   209  to true, making the two predicates equivilant while allowing for a far
   210  more efficient execution/
   211  
   212  [1]: Relevant index data includes the columns, extra columns, and store
   213       columns of the index.
   214  
   215  ## 2. Physical checks and their implementation
   216  
   217  The **physical table** is the underlying k/v behind tables. This is
   218  where we find primary and secondary indexes (as distinct keys),
   219  groupings of certain data (column families), how index data is organized
   220  (index storing conditions), and the encoding of the data.
   221  
   222  Most of the checks that involve the physical table are encoding and
   223  related data validity. This data is already partially checked during
   224  scanning in `RowFetcher`. Because of this, adding checks is a matter of
   225  a few additive changes, and handling the error to instead return row
   226  results.
   227  
   228  The checks we can implement in `RowFetcher` include:
   229  - Tables using [FAMILY][] have the data organized into families
   230    correctly
   231  - Invalid data encodings
   232  - Non-canonical encodings (i.e. data round-trips)
   233  - [Composite encodings][] are valid
   234  - Key encoding correctly reflects the ordering of the decoded values
   235  - `NOT NULL` constraint violation (not physical table, but checked at
   236    this layer)
   237  
   238  The underlying scanning is in the `RowFetcher`. Inside the current
   239  `RowFetcher` the method `processKv` is responsible for encoding
   240  specific work as it iterates over the k/v, which is where the checks
   241  will be added, including:
   242  - Check composite encodings are correct and round-trip
   243  - Check key encodings correctly reflect ordering of decoded values
   244  - Check the KV pair is well-formed (everything is present, no extra
   245    data, checksum matches)
   246  - For primary keys, verify that the column family 0 (sentinel) is
   247    present
   248  
   249  We can also throttle the speed of the check by modifying the `limitHint`
   250  provided to `KVFetcher`, and sleeping between batches.
   251  
   252  Thus, to run the check we just need to do a scan over the whole table,
   253  and the returned values will be the `CheckError` as found in the
   254  [SCRUB interface RFC]. Ideally, this is executed through the distSQL, as
   255  there will be very few results relative to the data scanned.
   256  
   257  Invalid data encoding (and others) already return an `error` while
   258  scanning. We want to capture errors in `RowFetcher` that come from
   259  any external function calls and wrap them in a `scrub.Error` struct with
   260  an error code annotated.
   261  
   262  Those error structs will be captured further upstream inside the
   263  TableReader and translated into the row results. In the existing
   264  scanning code path, this same error struct will simply be unwrapped.
   265  
   266  For example, consider this excerpt from `RowFetcher.processValueSingle`:
   267  
   268  ```go
   269  value, err := UnmarshalColumnValue(rf.alloc, typ, kv.Value)
   270  if err != nil {
   271    return "", "", err
   272  }
   273  // Becomes
   274  value, err := UnmarshalColumnValue(rf.alloc, typ, kv.Value)
   275  if err != nil {
   276    return "", "", scrub.NewError(
   277      scrub.InvalidEncodingError,
   278      err,
   279    )
   280  }
   281  ```
   282  
   283  Another chunk of the work will then be plumbing through the check error
   284  results out as the table schema we are scanning will differ from the
   285  result rows representing errors. To get around this we must manually
   286  construct the distSQL physical plan. This will also take some extra code
   287  to wrap the call from TableReader to RowFetcher where it will transform
   288  any caught errors.
   289  
   290  [SCRUB interface RFC]: https://github.com/cockroachdb/cockroach/pull/18675
   291  
   292  ## 3. Generating and executing the logical plan in the distSQL execution engine
   293  
   294  During local engine execution, we will generate the necessary check
   295  plans, then execute the plans in sequence on the distSQL execution
   296  engine. These plans will scan the data in a distributed fashion and the
   297  only data leaving the nodes are any error results. This strategy is
   298  similar to schema change index backfilling.
   299  
   300  The advantage of this over the alternative method [3], creating the full
   301  logical plan during query planning, is that we can easily run the checks
   302  sequentially. We can block waiting for the distSQL results of each check
   303  before running the next check. This also pulls the work of aggregating
   304  results from the distSQL execution engine to simpler Go code that
   305  accumulates the results.
   306  
   307  
   308  ## 4. Optimizations
   309  
   310  Note that this section is largely just the ideas behind optimizations
   311  and are not required to be resolved. The goal of this section is to
   312  gauge feasibility, and whether or not it's worth it to go down those
   313  routes.
   314  
   315  
   316  ### 4.a. Multiple simultaneous index checks
   317  
   318  A challenge here is producing both check errors and the necessary data
   319  to run index checks simultaneously. It is not currently known if
   320  multiple different schemas can be streamed out of one table reader.
   321  
   322  In the `RowFetcher`, this may look like a `NextRowOrCheckError` iterator
   323  that may return either row data or check error data, and we emit those
   324  to the corresponding outputs.
   325  
   326  A workaround may be to have an intermediary processor for each mirrored
   327  stream which filters out only the columns desired for each consumer --
   328  either a specific index check (a joiner), or the final result errors (in
   329  final aggregation).
   330  
   331  Alternatively, a crazy idea may modify the RowFetcher or TableReader to
   332  produce two separate schemas, branching off the idea of RowFetchers
   333  made for the adjustments for interleaved tables.
   334  
   335  
   336  ### 4.b. Executing index and physical checks simultaneously
   337  
   338  While checking multiple indexes sequentially is discussed above, but a
   339  problem is that a naive solution involves doing `N` scans of the primary
   340  k/vs for checking `N` indexes. The following approaches are both
   341  non-trivial changes, so this idea is largely suggested as an
   342  optimization for future work.
   343  
   344  The extra scans can be avoided if we scan all the necessary data for all
   345  indexes being check in one table reader, then use a mirror router to
   346  mirror the data to multiple join consumers. Unfortunately, this would
   347  require making one of the changes:
   348  
   349  - Manual surgery of a distSQL physical plan to independently plan each
   350    index check, then combine them with:
   351  
   352     - Slicing out the primary k/v scan in each, replacing it with a
   353       single mirrored scan that fetches all the data.
   354  
   355     - Remove each plan's final aggregator, and have them all stream to a
   356       single new final aggregator.
   357  
   358  - Or, adding the mechanisms to be able to do a `UNION` of all the
   359    results and then elide redundant table readers based on processors'
   360    column requirements. This is the same final aggregation `UNION`
   361    mentioned in section 3.a. but we also need to solve the problem of
   362    scanning the same k/v data multiple times.
   363  
   364  
   365  # Rationale and Alternatives
   366  
   367  ## [1] Use a `JOIN READER` aggregator for index checking
   368  
   369  Instead of using a `JOINER` aggregator, we could use a `JOIN READER`.
   370  How this would work is that instead of the 3 stages introduced in 1.a,
   371  the process for checking would involve 2 steps:
   372  1) Scan the primary k/v for a table.
   373  2) Use a `JOIN READER` aggregator that does point-lookups against the
   374     secondary index.
   375  
   376  This has two major problems:
   377  - Producing lots of RPCs for the secondary index entries, potentially
   378    being a performance problem.
   379  - Does not check the secondary index for dangling entries. This requires
   380    a second scan which does the reverse point-lookup.
   381  
   382  
   383  ## [2] Do point-lookups directly during RowFetching for index checking
   384  
   385  This is by far the easiest method to do as it only involves:
   386  1) Collect the required secondary index data. The first code path that
   387     accesses required data is in `RowFetcher`.
   388  2) Do point-lookups for the secondary index entry, and compare.
   389  
   390  Step 1 is ideally done in the `RowFetcher` as this is also where we can
   391  avoid passing decoded data further upstream, when we only need check
   392  errors.
   393  
   394  Step 2 can be optimized by batching the k/v lookups, which would be a
   395  trivial mechanism to add inside `RowFetcher`. In this case, having an
   396  alternative `NextRow` called `NextRowWithErrors` makes more sense to not
   397  add this code to the critical code path.
   398  
   399  A [proof of concept] has been done where this does not batch lookups or
   400  propagate errors to the client. Errors encountered are just logged.
   401  
   402  This is very similar to the alternative proposal [1]. This approach has
   403  the same problems.
   404  
   405  [proof of concept]: https://github.com/lego/cockroach/blob/0f022a8e0ef97d8233ccc8aa31466090ef3dee7e/pkg/sql/sqlbase/rowfetcher.go#L651-L673
   406  
   407  ## [3] Creating the full logical plan during query planning -- transforming SCRUB planNodes into the check logic.
   408  
   409  As an alternative to section 3, during query planning we can transform
   410  the scrub planNode into a plan that would execute all of the check
   411  logic. The goal of this would be to use existing planNodes to express
   412  the check logic, as some of the checks such as index checks can be
   413  expressed completely through SQL as seen in section 1. This was not
   414  picked as the strategy to implement because there were more challenges
   415  with this approach.
   416  
   417  For this option, during the planning phase, we will expand a `Scrub`
   418  planning node to include plans that encompass all of the checking work.
   419  Because of this, the plan will not need to run additional plan and
   420  execution phases.
   421  
   422  This may be easier to implement as we don't need to be concerned with
   423  creating a new distSQLPlanner, it will be limited in a few ways. If any
   424  part of the `SCRUB` involves plan nodes that cannot be executed in the
   425  distSQL execution engine, we have no choice but to implement those
   426  features or come up with a workaround.
   427  
   428  One example of that problem is the need for a form of `UNION` or special
   429  case workaround to aggregate the error results for multiple index
   430  checks. To do this, we will need to add some form of `UNION` or special
   431  case workaround to do the aggregation and sequential execution step for
   432  this.
   433  
   434  A special case where we have new plan nodes specifically to union the
   435  errors might look like:
   436  
   437  ```go
   438  type ScrubCheckNode {
   439    // Sub-plan which checks a particular aspect.
   440    Checks []ScrubCheck
   441  }
   442  
   443  type ScrubCheck {
   444    Check CheckType
   445    // Plan for running the check.
   446    plan *plan
   447  }
   448  ```
   449  
   450  Then during distSQL physical plan creation, we will plan each
   451  `ScrubCheck.plan` and route their streams into a final aggregator. This
   452  may be a problem though, as this will run all checks in parallel and not
   453  in sequence, which will adversely affect foreground traffic.
   454  
   455  Lastly, we may need to have an alternative to a `TableReader`, call it
   456  `TableChecker`, where we call the iterator `NextRowWithErrors` instead of
   457  `NextRow` to produce all physical errors.
   458  
   459  # Unresolved questions