github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160824_quiesce_ranges.md (about)

     1  - Feature Name: Quiesce Ranges
     2  - Status: obsolete
     3  - Start Date: 2016-06-24
     4  - Authors: David Taylor, Daniel Harrison, Tobias Schottdorf
     5  - RFC PR: #8811
     6  - Cockroach Issue: [#357](https://github.com/cockroachdb/cockroach/issues/357)
     7  
     8  
     9  # Summary
    10  Replicas of inactive ranges could potentially shutdown their raft group to save
    11  cpu, memory, and network traffic.
    12  
    13  # Motivation
    14  As clusters grow to potentially very large numbers of ranges, if access patterns
    15  are such that some of those ranges see no traffic, maintaining the raft groups
    16  and other state for them wastes substantial resources for little benefit.
    17  
    18  Coalesced raft heartbeats potentially mitigate _some_ of this waste, in terms of
    19  number of network requests, but after unpacking still require processing and
    20  thus do not eliminate the marginal cost of maintaining inactive ranges.
    21  
    22  Detecting inactive ranges and quiescing them -- shutting down their raft group
    23  and possibly flagging them to skip other bookkeeping and maintenance -- could
    24  offer substantial savings for clusters with very large numbers of ranges.
    25  
    26  Additionally, some operations may be able to use the fact a range is in a
    27  quiesced state. For example, some possible implementations of bulk ingestion of
    28  *new* ranges could be simplified (by being able to assume the raft state is
    29  frozen).
    30  
    31  # Detailed design
    32  Replicas of a quiescent range do not maintain a raft group.
    33  
    34  The most significant challenge is coordinating the transition from active to
    35  quiesced: a leader choosing to not maintain a quiesced raft group is easily
    36  mistaken for failing to maintain an active one.
    37  
    38  To enter the quiesced state, the leader sends a quiesce command to all
    39  followers. Upon receipt, followers stop expecting heartbeats for the remainder
    40  of the term. A raft group is quiesced only for one term.
    41  
    42  Once followers are no longer expecting heartbeats, the leader can stop sending
    43  them. If, however, it does this before a follower has received and processed the
    44  quiesce command, that follower will assume the leader is lost and call a new
    45  election, waking the group back up. To prevent this, the leader should continue
    46  sending heartbeats for some period after the command is issued -- specifically
    47  either until all followers acknowledge it or until the election timeout has
    48  elapsed (the election timeout is, by definition, the amount of time after which
    49  a node is assumed to be unreachable).
    50  
    51  If a follower is unreachable for more than the election timeout and and thus is
    52  unaware the group has quiesced, it will attempt to call an election and wake the
    53  range back up. This however would happen anyway -- since it was unreachable for
    54  more than the election timeout, it was going to call the election anyway, thus
    55  ending the term that quiesced.
    56  
    57  Once quiesced, any replica asked to propose a command should restart raft and
    58  trigger an election to be able to acquire the lease.
    59  
    60  A node that is partitioned or killed when a range quiesces will be unaware of
    61  that change when it returns, and may initiate a new election causing the range
    62  to return to actively maintaining raft until it quiesces again.
    63  
    64  # Drawbacks
    65  The largest drawback is reasoning about and maintaining the additional
    66  complexity that this introduces.
    67  
    68  As mentioned above, this only benefits clusters where access patterns are such
    69  that large numbers of ranges see no activity.
    70  
    71  This almost certainly needs to be implemented at least partially upstream in
    72  raft -- inspecting the follower state to determine when it is safe to stop
    73  heartbeating involves inspecting internal raft state. Upstream raft may not see
    74  as much value in incorporating this as the majority of users likely have only a
    75  single raft group, so they would see little benefit.
    76  
    77  Additionally, the first request to a range that has quiesced will see a
    78  non-trivial latency penalty, as it will need to start the raft group back up
    79  before it can proceed (though a in-place revival by the leader discussed below
    80  could mitigate this).
    81  
    82  Killed or partitioned nodes will spuriously wake groups that had quiesced in
    83  their absence, but as the range will still have not seen recent activity, it can
    84  re-quiesce shortly thereafter. Ccombined with the fact that this should be some
    85  what rare in normal operation, the expected cost of spurious wakeups should be
    86  small (and no worse than if the ranges didn't quiesce at all).
    87  
    88  # Alternatives
    89  Rather than quiescing the remainder of the current term, one could potentially
    90  call an election with an extra quiesced flag. A follower that votes for that
    91  would not expect heartbeats from that leader for that term. This seems to
    92  involve strictly more modification to raft, and it harder to reason about as it
    93  spans multiple terms and must thus consider multiple leader scenarios.
    94  
    95  As a follow-up optimization, the former leader of a quiesced range could
    96  optimistically attempt to revive the quiesced group *in-place*, without starting
    97  a new term, by resuming heartbeats and proposing commands. Any other node
    98  wishing to revive the group would still need to initiate an election for a new
    99  term, as would the former leader if it failed to get a majority of ACKs on the
   100  first command it proposed after waking. This can be done purely as an
   101  optimization at a later time.