github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160210_raft_consistency_checker.md (about) 1 - Feature Name: Raft consistency checker 2 - Status: completed 3 - Start Date: 2016-02-10 4 - Authors: Ben Darnell, David Eisenstat, Bram Gruneir, Vivek Menezes 5 - RFC PR: [#4317](https://github.com/cockroachdb/cockroach/pull/4317), 6 [#8032](https://github.com/cockroachdb/cockroach/pull/8032) 7 - Cockroach Issues: [#837](https://github.com/cockroachdb/cockroach/issues/837), 8 [#7739](https://github.com/cockroachdb/cockroach/issues/7739) 9 10 Summary 11 ======= 12 13 An online consistency checker that periodically compares snapshots of 14 range replicas at a specific point in the Raft log. These snapshots 15 should be the same. 16 17 An API for direct invocation of the checker, to be used in tests and the 18 CLI. 19 20 Motivation 21 ========== 22 23 Consistency! Correctness at scale. 24 25 Design 26 ====== 27 28 Each node scans continuously through its local range replicas, 29 periodically initiating a consistency check on ranges for which it is 30 currently the lease holder. 31 32 1. The initiator of the check invokes the Raft command 33 `ComputeChecksum` (in `roachpb.RequestUnion`), marking the point at 34 which all replicas take a snapshot and compute its checksum. 35 36 2. Outside of Raft, the initiator invokes `CollectChecksum` (in 37 `service internal`) on the other replicas. The request message 38 includes the initiator's checksum so that whenever a replica's 39 checksum is inconsistent, both parties can log that fact. 40 41 3. If the initiator discovers an inconsistency, it immediately retries 42 the check with the `snapshot` option set to true. In this mode, 43 inconsistent replicas include their full snapshot in their 44 `CollectChecksum` response. The initiator retains its own snapshot 45 long enough to log the diffs and panic (so that someone 46 will notice). 47 48 Details 49 ------- 50 51 The initiator of a consistency check chooses a UUID that relates its 52 `CollectChecksum` requests to its `ComputeChecksum` request 53 (`checksum_id`). Retried checks use a different UUID. 54 55 Replicas store information about ongoing consistency checks in a map 56 keyed by UUID. The entries of this map expire after some time so that 57 failures don't cause memory leaks. 58 59 To avoid blocking Raft, replicas handle `ComputeChecksum` requests 60 asynchronously via MVCC. `CollectChecksum` calls are outside of Raft and 61 block until the response checksum is ready. Because the channels are 62 separate, replicas may receive related requests out of order. 63 64 `ComputeChecksum` requests have a `version` field, which specifies the 65 checksum algorithm. This allows us to switch algorithms without 66 downtime. The current algorithm is to apply SHA-512 to all of the KV 67 pairs returned from `replicaDataIterator`. 68 69 If the initiator needs to retry a consistency check but finds that the 70 range has been split or merged, it logs an error instead. 71 72 API 73 --- 74 75 A cockroach node will support a command through which an admin or a test 76 can check the consistency of all ranges for which it is a lease holder 77 using the same mechanism provided for the periodic consistency checker. 78 This will be used in all acceptance tests. 79 80 Later if needed it will be useful to support a CLI command for an admin 81 to run consistency checks over a section of the KV map: e.g., 82 \[roachpb.KeyMin, roachpb.KeyMax). Since the underlying ranges within a 83 specified KV section of the map can change while consistency is being 84 checked, this command will be implemented through kv.DistSender to allow 85 command retries in the event of range splits/merges. 86 87 Failure scenarios 88 ----------------- 89 90 If the initiator of a consistency check dies, the check dies with it. 91 This is acceptable because future range lease holders will initiate new 92 checks. Replicas that compute a checksum anyway store it until it 93 expires. 94 95 It doesn't matter whether the initiator remains the range lease holder. 96 The reason that the lease holder initiates is to avoid concurrent 97 consistency checks on the same range, but there is no correctness issue. 98 99 Replicas that die cause their `CollectChecksum` call to time out. The 100 initiator logs the error and moves on. Replicas that restart without 101 replaying the `ComputeChecksum` command also cause `CollectChecksum` to 102 time out, since they have no record of the consistency check. Replicas 103 that do replay the command are fine. 104 105 Drawbacks 106 ========= 107 108 There could be some performance drawbacks of periodically computing the 109 checksum. We eliminate them by running the consistency checks 110 infrequently (once a day), and by spacing them out in time for different 111 ranges. 112 113 A bug in the consistency checker can spring false alerts. 114 115 Alternatives 116 ============ 117 118 1. A consistency checker that runs offline, or only in tests. 119 120 2. An online consistency checker that collects checksums from all the 121 replicas, computes the majority agreed upon checksum, and supplies 122 it down to the replicas. While this could be a better solution, we 123 feel that we cannot depend on a majority vote because new replicas 124 brought up with a bad lease holder supplying them with a snapshot 125 would agree with the bad lease holder, resulting in a bad 126 majority vote. This method is slightly more complex and does not 127 necessarily improve upon the current design. 128 129 3. A protocol where the initiator gets the diff of an inconsistent 130 replica on the first pass. The performance cost of retaining 131 snapshot engines is unknown, so we'd rather complicate the 132 implementation of the consistency checker. 133 134 Unresolved questions 135 ==================== 136 137 None.