github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150729_segmented_storage.md (about)

     1  - Feature Name: segmented_storage
     2  - Status: rejected
     3  - Start Date: 2015-07-29
     4  - RFC PR: [#1866](https://github.com/cockroachdb/cockroach/pull/1866)
     5  - Cockroach Issue: [#1644](https://github.com/cockroachdb/cockroach/issues/1644)
     6  
     7  # Rejection notes
     8  
     9  This proposal was deemed too complex and expensive for the problem it
    10  solves. Instead, we will drop snapshots whose application would create
    11  a conflict in the `replicasByKey` map. This avoids the race conditions
    12  in issue #1644, but leaves the range in an uninitialized and unusable
    13  state. In the common case, this state will resolve quickly, and in the
    14  uncommon case when it persists, we simply rely on the usual repair and
    15  recovery process to move the replica to a new node.
    16  
    17  # Summary
    18  
    19  Partition the RocksDB keyspace into segments so that replicas created
    20  by raft replication do not share physical storage with replicas
    21  created by splits.
    22  
    23  # Motivation
    24  
    25  Currently, keys in the distributed sorted map correspond more or less
    26  directly to keys in RocksDB. This makes splits and merges cheap
    27  (since the bulk of the data does not need to be moved to a new
    28  location), but it introduces ambiguity since the same RocksDB key
    29  may be owned by different ranges at different times.
    30  
    31  For a concrete example of the problems this can cause (discussed more
    32  fully in
    33  [#1644](https://github.com/cockroachdb/cockroach/issues/1644)),
    34  consider a node `N3` which is temporarily down while a range `R1` is
    35  split (creating `R2`). When the range comes back up, the leaders of
    36  both `R1` and `R2` (`N1` and `N2` respectively) will try to bring it
    37  up to date. If `R2` acts first, it will see that `N3` doesn't have any
    38  knowledge of `R2` and so it sends a snapshot. The snapshot will replace
    39  data in `R2`'s keyspace, which `N3`'s replica of `R1` still covers.
    40  `N3` cannot correctly process any messages relating to `R2` until `R1`
    41  has caught up to the point of the split.
    42  
    43  # Detailed design
    44  
    45  ## Segment IDs
    46  
    47  Each replica is associated with a **segment ID**. When a replica is
    48  created in response to a raft message, it gets a newly-generated
    49  segment ID. When a replica is created as a part of `splitTrigger`, it
    50  shares the parent replica's segment ID. Segment IDs are unique per
    51  store and are generated from a store-local counter. They are generally
    52  not sent over the wire (except perhaps for debugging info); all
    53  awareness of segment IDs is contained in the storage package.
    54  
    55  ## Key encoding
    56  
    57  We introduce a new level of key encoding at the storage level.
    58  For clarity, the existing `roachpb.EncodedKey` type will be renamed to
    59  `roachpb.MVCCKey`, and the three levels of encoding will be as follows:
    60  
    61  * `Key`: a raw key in the monolithic sorted map.
    62  * `StorageKey`: a `Key` prefixed with a segment ID.
    63  * `MVCCKey`: a `StorageKey` suffixed with a timestamp.
    64  
    65  The functions in `storage/engine` will take `StorageKeys` as input and
    66  use `MVCCKeys` internally. All code outside the `storage` package will
    67  continue to use raw `Keys`, and even inside the `storage` package
    68  conversion to `StorageKey` will usually be done immediately before a
    69  call to an MVCC function.
    70  
    71  The actual encoding will use fixed-width big-endian integers, similar
    72  to the encoding of the timestamp in MVCCKey. Thus a fully-encoded key
    73  is:
    74  
    75  ```
    76  +-----------------------------------------------+
    77  |               roachpb.MVCCKey                   |
    78  +-----------------------+
    79  |   roachpb.StorageKey    |
    80             +------------+
    81             | roachpb.Key  |
    82  
    83  Segment ID | Raw key    | Wall time | Logical TS |
    84  4 bytes    | (variable) | 8 bytes   | 4 bytes    |
    85  ```
    86  
    87  All keys not associated with a replica (including the counter used to
    88  generate segment IDs) will use segment ID 0.
    89  
    90  ## Splitting and snapshotting
    91  
    92  Ranges can be created in two ways (ignoring the initial bootstrapping
    93  of the first range): an existing range splits into a new range on the
    94  same store, or a raft leader sends a snapshot to a store that should
    95  have a replica of the same range but doesn't.
    96  
    97  Each replica-creation path will need to consider whether the replica
    98  has already been created via the other path (comparing replica IDs,
    99  not just range IDs). In `splitTrigger`, if the replica already exists
   100  under a different segment, then a snapshot occurred before the split.
   101  The left-hand range should delete all data that are outside the bounds
   102  established by the split. In the `ApplySnapshot` path, a new segment
   103  will need to be created only if the replica has not already been
   104  assigned a segment.
   105  
   106  TODO(bdarnell): `ApplySnapshot` happens in the `Store`'s raft
   107  goroutine, but raft may call other (read-only) methods on its own
   108  goroutine. I think this is safe (raft already has to handle the data
   109  changing out from under it in other ways), but we should double-check
   110  that raft behaves sanely in this case.
   111  
   112  # Drawbacks
   113  
   114  * Adding a segment ID to every key is a non-trivial storage cost.
   115  * Merges will require copying the entire data of at least one range to
   116    put them into the same segment.
   117  
   118  # Alternatives
   119  
   120  * An earlier version of this proposal did not reuse segment IDs on
   121    splits, so splits required copying the new range's data to a new
   122    segment (segments were also identified by a (range ID, replica ID)
   123    tuple instead of a separate ID).
   124  
   125  # Unresolved questions
   126  
   127  * Whenever a split and snapshot race, we are wasting work, since the
   128    snapshot will be ignored if the split completes while the snapshot
   129    is in flight. It's probably worthwhile to prevent or delay sending
   130    snapshots when an in-progress split should be able to accomplish the
   131    same thing more cheaply. This is less of an issue currently as new
   132    ranges are started in a leaderless state and so no snapshots will be
   133    sent until a round of elections, but we intend to kick-start
   134    elections in this case to minimize unavailability so we will need to
   135    be mindful of the cost of premature snapshots.