github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160210_raft_consistency_checker.md (about)

     1  -   Feature Name: Raft consistency checker
     2  -   Status: completed
     3  -   Start Date: 2016-02-10
     4  -   Authors: Ben Darnell, David Eisenstat, Bram Gruneir, Vivek Menezes
     5  -   RFC PR: [#4317](https://github.com/cockroachdb/cockroach/pull/4317),
     6              [#8032](https://github.com/cockroachdb/cockroach/pull/8032)
     7  -   Cockroach Issues: [#837](https://github.com/cockroachdb/cockroach/issues/837),
     8                        [#7739](https://github.com/cockroachdb/cockroach/issues/7739)
     9  
    10  Summary
    11  =======
    12  
    13  An online consistency checker that periodically compares snapshots of
    14  range replicas at a specific point in the Raft log. These snapshots
    15  should be the same.
    16  
    17  An API for direct invocation of the checker, to be used in tests and the
    18  CLI.
    19  
    20  Motivation
    21  ==========
    22  
    23  Consistency! Correctness at scale.
    24  
    25  Design
    26  ======
    27  
    28  Each node scans continuously through its local range replicas,
    29  periodically initiating a consistency check on ranges for which it is
    30  currently the lease holder.
    31  
    32  1.  The initiator of the check invokes the Raft command
    33      `ComputeChecksum` (in `roachpb.RequestUnion`), marking the point at
    34      which all replicas take a snapshot and compute its checksum.
    35  
    36  2.  Outside of Raft, the initiator invokes `CollectChecksum` (in
    37      `service internal`) on the other replicas. The request message
    38      includes the initiator's checksum so that whenever a replica's
    39      checksum is inconsistent, both parties can log that fact.
    40  
    41  3.  If the initiator discovers an inconsistency, it immediately retries
    42      the check with the `snapshot` option set to true. In this mode,
    43      inconsistent replicas include their full snapshot in their
    44      `CollectChecksum` response. The initiator retains its own snapshot
    45      long enough to log the diffs and panic (so that someone
    46      will notice).
    47  
    48  Details
    49  -------
    50  
    51  The initiator of a consistency check chooses a UUID that relates its
    52  `CollectChecksum` requests to its `ComputeChecksum` request
    53  (`checksum_id`). Retried checks use a different UUID.
    54  
    55  Replicas store information about ongoing consistency checks in a map
    56  keyed by UUID. The entries of this map expire after some time so that
    57  failures don't cause memory leaks.
    58  
    59  To avoid blocking Raft, replicas handle `ComputeChecksum` requests
    60  asynchronously via MVCC. `CollectChecksum` calls are outside of Raft and
    61  block until the response checksum is ready. Because the channels are
    62  separate, replicas may receive related requests out of order.
    63  
    64  `ComputeChecksum` requests have a `version` field, which specifies the
    65  checksum algorithm. This allows us to switch algorithms without
    66  downtime. The current algorithm is to apply SHA-512 to all of the KV
    67  pairs returned from `replicaDataIterator`.
    68  
    69  If the initiator needs to retry a consistency check but finds that the
    70  range has been split or merged, it logs an error instead.
    71  
    72  API
    73  ---
    74  
    75  A cockroach node will support a command through which an admin or a test
    76  can check the consistency of all ranges for which it is a lease holder
    77  using the same mechanism provided for the periodic consistency checker.
    78  This will be used in all acceptance tests.
    79  
    80  Later if needed it will be useful to support a CLI command for an admin
    81  to run consistency checks over a section of the KV map: e.g.,
    82  \[roachpb.KeyMin, roachpb.KeyMax). Since the underlying ranges within a
    83  specified KV section of the map can change while consistency is being
    84  checked, this command will be implemented through kv.DistSender to allow
    85  command retries in the event of range splits/merges.
    86  
    87  Failure scenarios
    88  -----------------
    89  
    90  If the initiator of a consistency check dies, the check dies with it.
    91  This is acceptable because future range lease holders will initiate new
    92  checks. Replicas that compute a checksum anyway store it until it
    93  expires.
    94  
    95  It doesn't matter whether the initiator remains the range lease holder.
    96  The reason that the lease holder initiates is to avoid concurrent
    97  consistency checks on the same range, but there is no correctness issue.
    98  
    99  Replicas that die cause their `CollectChecksum` call to time out. The
   100  initiator logs the error and moves on. Replicas that restart without
   101  replaying the `ComputeChecksum` command also cause `CollectChecksum` to
   102  time out, since they have no record of the consistency check. Replicas
   103  that do replay the command are fine.
   104  
   105  Drawbacks
   106  =========
   107  
   108  There could be some performance drawbacks of periodically computing the
   109  checksum. We eliminate them by running the consistency checks
   110  infrequently (once a day), and by spacing them out in time for different
   111  ranges.
   112  
   113  A bug in the consistency checker can spring false alerts.
   114  
   115  Alternatives
   116  ============
   117  
   118  1.  A consistency checker that runs offline, or only in tests.
   119  
   120  2.  An online consistency checker that collects checksums from all the
   121      replicas, computes the majority agreed upon checksum, and supplies
   122      it down to the replicas. While this could be a better solution, we
   123      feel that we cannot depend on a majority vote because new replicas
   124      brought up with a bad lease holder supplying them with a snapshot
   125      would agree with the bad lease holder, resulting in a bad
   126      majority vote. This method is slightly more complex and does not
   127      necessarily improve upon the current design.
   128  
   129  3.  A protocol where the initiator gets the diff of an inconsistent
   130      replica on the first pass. The performance cost of retaining
   131      snapshot engines is unknown, so we'd rather complicate the
   132      implementation of the consistency checker.
   133  
   134  Unresolved questions
   135  ====================
   136  
   137  None.