github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20150819_stateless_replica_relocation.md (about)

     1  - Feature Name: Stateless Replica Relocation
     2  - Status: completed
     3  - Start Date: 2015-08-19
     4  - RFC PR: [#2171](https://github.com/cockroachdb/cockroach/pull/2171)
     5  - Cockroach Issue: [#620](https://github.com/cockroachdb/cockroach/issues/620)
     6  
     7  # Summary
     8  A relocation is, conceptually, the transfer of a single replica from one store to
     9  another. However, the implementation is necessarily the combination of two operations:
    10  
    11  1. Creating a new replica for a range on a new store.
    12  2. Removing a replica of the same range from a different store.
    13  
    14  For example, by creating a replica on store Y and then removing a replica from
    15  store X, you have in effect moved the replica from X to Y.
    16  
    17  This RFC is suggesting an overall architectural design goal: that the decision to
    18  make any individual operation (either a create or a remove) should be **stateless**.
    19  In other words, the second operation in a replica relocation should not depend on a
    20  specific invocation of the first operation.
    21  
    22  # Motivation
    23  For an assortment of reasons, Cockroach will often need to relocate the replicas
    24  of a range. Most immediately, this is needed for repair (when a store dies and
    25  its replicas are no longer usable) and rebalance (relocating replicas on
    26  overloaded stores to stores with excess capacity).
    27  
    28  A relocation must be expressed as a combination of two operations:
    29  
    30  1. Creating a new replica for a range on a new store.
    31  2. Removing a replica of the same range from a different store.
    32  
    33  These operations can happen in either order as long as quorum is maintained in
    34  the range's raft group after each individual operation, although one ordering may be
    35  preferred over another.
    36  
    37  Expressing a specific relocation (i.e. "move replica from store X to store Y")
    38  would require maintaining some persistent state to link the two operations
    39  involved. Storing that state presents a number of issues: where is it stored,
    40  in memory or on disk? If it's in memory, does it have to be replicated through
    41  raft or is it local to one node? If on disk, can the persistent state become
    42  stale? How do you detect conflicts between two relocation operations initiated
    43  from different places?
    44  
    45  This RFC suggests that no such relocation state should be persisted. Instead, a
    46  system that wants to initiate a relocation will perform only the first
    47  operation; a different system will later detect the need for a complementary
    48  operation and perform it. A relocation is thus completed without any state being
    49  exchanged between those two systems.
    50  
    51  By eliminating the need to persist any data about in-progress relocation
    52  operations, the overall system is dramatically simplified.
    53  
    54  # Detailed design
    55  The implementation involves a few pieces:
    56  
    57  1. Each range must have a persisted *target replication state*. This does not
    58     prescribe specific replica locations; it specifies a required count of
    59     replicas, along with some desired attributes for the stores where they are
    60     placed.
    61  2. The core mechanic is a stateless function which can compare the immediate
    62     replication state of a range to its target replication state; if the target
    63     state is different, this function will either create or remove a replica in
    64     order to move the range towards the target replication state. By running
    65     multiple times (adding or removing a replica each time), the target
    66     replication state will eventually be matched.
    67  3. Any operations that wish to *relocate* a replica need only perform the first
    68     operation of the relocation (either a create or a remove). This will perturb
    69     the range's replication state away from the target; the core function will
    70     later detect that mismatch, and correct it by performing the complementary
    71     operation of the relocation (a remove or a create).
    72  
    73  The first piece is already present: each range has a zone configuration
    74  which determines the target replication state for the range.
    75  
    76  The second piece, the core mechanic, will be performed by the existing
    77  "replicate queue" which will be renamed the "*replication queue*". This queue is
    78  already used to add replicas to ranges which are under-replicated; it can be
    79  enhanced to remove replicas from over-replicated ranges, thus satisfying the
    80  basic requirements of the core mechanic.
    81  
    82  The third piece simply informs the design of systems performing relocations; for
    83  example, the upcoming repair and rebalance systems (still being planned).  After
    84  identifying a relocation opportunity, these systems will perform the first
    85  operation of the relocation (add or remove) and then insert the corresponding
    86  replica into the local replication queue. The replication queue will then
    87  perform the complementary operation. 
    88  
    89  The final complication is how to ensure that the replicate queue promptly
    90  identifies ranges outside of their ideal target state. As a queue it will be
    91  attached to the replica scanner, but we will also want to enqueue a replica
    92  immediately whenever we knowingly perturb the replication state. Thus,
    93  components initiating a relocation (e.g. rebalance or repair) should immediately
    94  enqueue their target replicas after changing the replication state. 
    95  
    96  # Drawbacks
    97  
    98  ### Specific Move Operations
    99  A stateless model for relocation precludes the ability to request specific
   100  relocations; only the first operation can be made with specificity.
   101  
   102  For example, the verb "Move replica from X to Y" cannot be expressed with
   103  certainty; instead, only "Move replica to X from (some store)" or "Move replica
   104  to Y from (some store)" can be expressed. The replicate queue will be responsible
   105  for selecting an appropriate store for the complementary operation.
   106  
   107  It is assumed that this level of specificity is simply not needed for any
   108  relocation operations; there is no currently apparent use case where a move
   109  between a specific pair of stores is needed.
   110  
   111  Even if this was necessary, it might be possible to express by manipulating the
   112  factors behind the individual stateless decisions.
   113  
   114  ### Thrashing of complementary operation
   115  Because there is no relocation state, the possibility of "thrashing" is introduced.
   116  For example:
   117  
   118  1. Rebalance operation adds a new replica to the range.
   119  2. The replicate queue picks up the range and detects the need to remove a
   120     replica; however, it decides to remove the replica that was just added.
   121  
   122  This is possible if the rebalance's criteria for new replica selection are
   123  sufficiently different from the replicate queue's selection criteria for
   124  removing a replica.
   125  
   126  To reduce this risk, there must be sufficient agreement in the criteria between
   127  the operations; a Rebalance operation should avoid adding a new replica if
   128  there's a realistic chance that the replicate queue will immediately remove it.
   129  
   130  This can realized by adding a "buffer" zone between the different criteria; that
   131  is, when selecting a replica to add, the new replica should be *significantly*
   132  removed from the criteria for removing a replica, thus reducing the chances that
   133  it will be selected.
   134  
   135  When properly tuned to avoid thrashing, the stateless nature could instead be
   136  considered a positive because it can respond to changes during the interim;
   137  consider if, due to a delay, the relative health of nodes changes and the
   138  originally selected replica is no longer a good option. The stateless system
   139  will respond correctly to this situation.
   140  
   141  ### Delay in complementary operation
   142  By separating the two operations, we are possibly delaying the application of
   143  the second operation. For example, the replicate queue could be very busy, or an
   144  untimely crash could result in the range being under- or over-replicated without
   145  being in the replicate queue on any store.
   146  
   147  However, many of these concerns are allayed by the existence of repair; if the
   148  node goes down, the repair system will add the range to the replicate queue on
   149  another store.
   150  
   151  Even in an exotic failure scenario, the stateless design will eventually detect
   152  the replication anomaly through the normal operation of the replica scanner.
   153  
   154  ### Lack of non-voting replicas
   155  Our raft implementation currently lacks support for non-voting members; as a
   156  result, some types of relocation will temporarily make the effected range more
   157  fragile. 
   158  
   159  When initially creating a replica, it is very far behind the current state of
   160  the range and thus needs to receive a snapshot. It may take some time before the
   161  range fully catches up and can take place in quorum commits.
   162  
   163  However, without non-voting replicas we have no choice but to add the new
   164  replica as a full member, thus changing the quorum requirements of the group. In
   165  the case of odd-numbered ranges, this will increase the quorum count by one,
   166  with the new range unable to be part of a quorum decision. This increases the
   167  chance of losing quorum until that replica is caught up, thus reducing
   168  availability.
   169  
   170  This could be mitigated somewhat without having to completely add-non voting
   171  replicas; in preparation for adding a replica, we could manually generate a
   172  snapshot and send it to the node *before* adding it to the raft configuration.
   173  This would decrease the window of time between adding the replica and having it
   174  fully catch up.
   175  
   176  "Lack of Non-voting replicas" is listed this as a drawback because going forward
   177  with relocation *without* non-voting replication introduces this fragility,
   178  regardless of how relocation decisions are made. Stateless relocation will still
   179  work correctly when non-voting replicas are implemented; there will simply be a
   180  delay in the case where a replica is added first (e.g. rebalance), with the
   181  removal not taking place until the non-voting replica catches up and is upgraded
   182  to a full group member. This is not trivial, but will still allow for stateless
   183  decisions.
   184  
   185  # Alternatives
   186  The main alternative would be some sort of stateful system, where relocation
   187  operations would be expressed explicitly (i.e. "Move replica from X to Y") and
   188  then seen to completion. For reasons outlined in the "motivation" section, this
   189  is considered sub-optimal when making distributed decisions.
   190  
   191  ### Relocation Master
   192  The ultimate expression of this would be the designation of a "Relocation
   193  master", a single node that makes all relocation decisions for the entire
   194  cluster
   195  
   196  There is enough information in gossip for a central node to make acceptable
   197  decisions about replica movement. In some ways the individual decisions would
   198  be worse, because they would be unable to consider current raft state; however,
   199  in aggregate the decisions could be much better, because the central master
   200  could consider groups of relocations together. For example, it would be able to
   201  avoid distributed pitfalls such as under-loaded nodes being suddenly inundated
   202  with requests for new replicas. It would be able to more quickly and correctly
   203  move replicas to new nodes introduced to the system.
   204  
   205  This system would also have an easier time communicating the individual
   206  relocation decisions that were made. This could be helpful for debugging or
   207  tweaking relocation criteria.
   208  
   209  However, It's important to note that a relocation master is not entirely
   210  incompatible with the "stateless" design; the relocation master could simply be
   211  the initiator of the stateless operations. You could thus get much of the
   212  improved decision making and communication of the master without having to move
   213  to a stateful design.
   214  
   215  # Unresolved questions
   216  
   217  ### Raft leader constraint
   218  The biggest unresolved issue is the requirement for relocation decisions to be
   219  constrained to the raft leader.
   220  
   221  For example, there is no firm need for a relocation operation to be initiated
   222  from a range leader; a secondary replica can safely initiate an add or remove
   223  replica, with the safety being guaranteed by a combination of raft and
   224  cockroach's transactions.
   225  
   226  However, if the criteria for an operation wants to consider raft state (such as
   227  which nodes are behind in replication), those decisions could only be made from
   228  the raft leader (which is the only node that has the full raft state).
   229  Alternatively, an interface could be provided for non-leader members to query
   230  that information from the leader.