github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_segmented_storage.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_segmented_storage.md (about)

1 - Feature Name: segmented_storage
2 - Status: rejected
3 - Start Date: 2015-07-29
4 - RFC PR: [#1866](https://github.com/cockroachdb/cockroach/pull/1866)
5 - Cockroach Issue: [#1644](https://github.com/cockroachdb/cockroach/issues/1644)
6
7 # Rejection notes
8
9 This proposal was deemed too complex and expensive for the problem it
10 solves. Instead, we will drop snapshots whose application would create
11 a conflict in the `replicasByKey` map. This avoids the race conditions
12 in issue #1644, but leaves the range in an uninitialized and unusable
13 state. In the common case, this state will resolve quickly, and in the
14 uncommon case when it persists, we simply rely on the usual repair and
15 recovery process to move the replica to a new node.
16
17 # Summary
18
19 Partition the RocksDB keyspace into segments so that replicas created
20 by raft replication do not share physical storage with replicas
21 created by splits.
22
23 # Motivation
24
25 Currently, keys in the distributed sorted map correspond more or less
26 directly to keys in RocksDB. This makes splits and merges cheap
27 (since the bulk of the data does not need to be moved to a new
28 location), but it introduces ambiguity since the same RocksDB key
29 may be owned by different ranges at different times.
30
31 For a concrete example of the problems this can cause (discussed more
32 fully in
33 [#1644](https://github.com/cockroachdb/cockroach/issues/1644)),
34 consider a node `N3` which is temporarily down while a range `R1` is
35 split (creating `R2`). When the range comes back up, the leaders of
36 both `R1` and `R2` (`N1` and `N2` respectively) will try to bring it
37 up to date. If `R2` acts first, it will see that `N3` doesn't have any
38 knowledge of `R2` and so it sends a snapshot. The snapshot will replace
39 data in `R2`'s keyspace, which `N3`'s replica of `R1` still covers.
40 `N3` cannot correctly process any messages relating to `R2` until `R1`
41 has caught up to the point of the split.
42
43 # Detailed design
44
45 ## Segment IDs
46
47 Each replica is associated with a **segment ID**. When a replica is
48 created in response to a raft message, it gets a newly-generated
49 segment ID. When a replica is created as a part of `splitTrigger`, it
50 shares the parent replica's segment ID. Segment IDs are unique per
51 store and are generated from a store-local counter. They are generally
52 not sent over the wire (except perhaps for debugging info); all
53 awareness of segment IDs is contained in the storage package.
54
55 ## Key encoding
56
57 We introduce a new level of key encoding at the storage level.
58 For clarity, the existing `roachpb.EncodedKey` type will be renamed to
59 `roachpb.MVCCKey`, and the three levels of encoding will be as follows:
60
61 * `Key`: a raw key in the monolithic sorted map.
62 * `StorageKey`: a `Key` prefixed with a segment ID.
63 * `MVCCKey`: a `StorageKey` suffixed with a timestamp.
64
65 The functions in `storage/engine` will take `StorageKeys` as input and
66 use `MVCCKeys` internally. All code outside the `storage` package will
67 continue to use raw `Keys`, and even inside the `storage` package
68 conversion to `StorageKey` will usually be done immediately before a
69 call to an MVCC function.
70
71 The actual encoding will use fixed-width big-endian integers, similar
72 to the encoding of the timestamp in MVCCKey. Thus a fully-encoded key
73 is:
74
75 ```
76 +-----------------------------------------------+
77 | roachpb.MVCCKey |
78 +-----------------------+
79 | roachpb.StorageKey |
80 +------------+
81 | roachpb.Key |
82
83 Segment ID | Raw key | Wall time | Logical TS |
84 4 bytes | (variable) | 8 bytes | 4 bytes |
85 ```
86
87 All keys not associated with a replica (including the counter used to
88 generate segment IDs) will use segment ID 0.
89
90 ## Splitting and snapshotting
91
92 Ranges can be created in two ways (ignoring the initial bootstrapping
93 of the first range): an existing range splits into a new range on the
94 same store, or a raft leader sends a snapshot to a store that should
95 have a replica of the same range but doesn't.
96
97 Each replica-creation path will need to consider whether the replica
98 has already been created via the other path (comparing replica IDs,
99 not just range IDs). In `splitTrigger`, if the replica already exists
100 under a different segment, then a snapshot occurred before the split.
101 The left-hand range should delete all data that are outside the bounds
102 established by the split. In the `ApplySnapshot` path, a new segment
103 will need to be created only if the replica has not already been
104 assigned a segment.
105
106 TODO(bdarnell): `ApplySnapshot` happens in the `Store`'s raft
107 goroutine, but raft may call other (read-only) methods on its own
108 goroutine. I think this is safe (raft already has to handle the data
109 changing out from under it in other ways), but we should double-check
110 that raft behaves sanely in this case.
111
112 # Drawbacks
113
114 * Adding a segment ID to every key is a non-trivial storage cost.
115 * Merges will require copying the entire data of at least one range to
116 put them into the same segment.
117
118 # Alternatives
119
120 * An earlier version of this proposal did not reuse segment IDs on
121 splits, so splits required copying the new range's data to a new
122 segment (segments were also identified by a (range ID, replica ID)
123 tuple instead of a separate ID).
124
125 # Unresolved questions
126
127 * Whenever a split and snapshot race, we are wasting work, since the
128 snapshot will be ignored if the split completes while the snapshot
129 is in flight. It's probably worthwhile to prevent or delay sending
130 snapshots when an in-progress split should be able to accomplish the
131 same thing more cheaply. This is less of an issue currently as new
132 ranges are started in a leaderless state and so no snapshots will be
133 sent until a round of elections, but we intend to kick-start
134 elections in this case to minimize unavailability so we will need to
135 be mindful of the cost of premature snapshots.