github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160824_quiesce_ranges.md (about) 1 - Feature Name: Quiesce Ranges 2 - Status: obsolete 3 - Start Date: 2016-06-24 4 - Authors: David Taylor, Daniel Harrison, Tobias Schottdorf 5 - RFC PR: #8811 6 - Cockroach Issue: [#357](https://github.com/cockroachdb/cockroach/issues/357) 7 8 9 # Summary 10 Replicas of inactive ranges could potentially shutdown their raft group to save 11 cpu, memory, and network traffic. 12 13 # Motivation 14 As clusters grow to potentially very large numbers of ranges, if access patterns 15 are such that some of those ranges see no traffic, maintaining the raft groups 16 and other state for them wastes substantial resources for little benefit. 17 18 Coalesced raft heartbeats potentially mitigate _some_ of this waste, in terms of 19 number of network requests, but after unpacking still require processing and 20 thus do not eliminate the marginal cost of maintaining inactive ranges. 21 22 Detecting inactive ranges and quiescing them -- shutting down their raft group 23 and possibly flagging them to skip other bookkeeping and maintenance -- could 24 offer substantial savings for clusters with very large numbers of ranges. 25 26 Additionally, some operations may be able to use the fact a range is in a 27 quiesced state. For example, some possible implementations of bulk ingestion of 28 *new* ranges could be simplified (by being able to assume the raft state is 29 frozen). 30 31 # Detailed design 32 Replicas of a quiescent range do not maintain a raft group. 33 34 The most significant challenge is coordinating the transition from active to 35 quiesced: a leader choosing to not maintain a quiesced raft group is easily 36 mistaken for failing to maintain an active one. 37 38 To enter the quiesced state, the leader sends a quiesce command to all 39 followers. Upon receipt, followers stop expecting heartbeats for the remainder 40 of the term. A raft group is quiesced only for one term. 41 42 Once followers are no longer expecting heartbeats, the leader can stop sending 43 them. If, however, it does this before a follower has received and processed the 44 quiesce command, that follower will assume the leader is lost and call a new 45 election, waking the group back up. To prevent this, the leader should continue 46 sending heartbeats for some period after the command is issued -- specifically 47 either until all followers acknowledge it or until the election timeout has 48 elapsed (the election timeout is, by definition, the amount of time after which 49 a node is assumed to be unreachable). 50 51 If a follower is unreachable for more than the election timeout and and thus is 52 unaware the group has quiesced, it will attempt to call an election and wake the 53 range back up. This however would happen anyway -- since it was unreachable for 54 more than the election timeout, it was going to call the election anyway, thus 55 ending the term that quiesced. 56 57 Once quiesced, any replica asked to propose a command should restart raft and 58 trigger an election to be able to acquire the lease. 59 60 A node that is partitioned or killed when a range quiesces will be unaware of 61 that change when it returns, and may initiate a new election causing the range 62 to return to actively maintaining raft until it quiesces again. 63 64 # Drawbacks 65 The largest drawback is reasoning about and maintaining the additional 66 complexity that this introduces. 67 68 As mentioned above, this only benefits clusters where access patterns are such 69 that large numbers of ranges see no activity. 70 71 This almost certainly needs to be implemented at least partially upstream in 72 raft -- inspecting the follower state to determine when it is safe to stop 73 heartbeating involves inspecting internal raft state. Upstream raft may not see 74 as much value in incorporating this as the majority of users likely have only a 75 single raft group, so they would see little benefit. 76 77 Additionally, the first request to a range that has quiesced will see a 78 non-trivial latency penalty, as it will need to start the raft group back up 79 before it can proceed (though a in-place revival by the leader discussed below 80 could mitigate this). 81 82 Killed or partitioned nodes will spuriously wake groups that had quiesced in 83 their absence, but as the range will still have not seen recent activity, it can 84 re-quiesce shortly thereafter. Ccombined with the fact that this should be some 85 what rare in normal operation, the expected cost of spurious wakeups should be 86 small (and no worse than if the ranges didn't quiesce at all). 87 88 # Alternatives 89 Rather than quiescing the remainder of the current term, one could potentially 90 call an election with an extra quiesced flag. A follower that votes for that 91 would not expect heartbeats from that leader for that term. This seems to 92 involve strictly more modification to raft, and it harder to reason about as it 93 spans multiple terms and must thus consider multiple leader scenarios. 94 95 As a follow-up optimization, the former leader of a quiesced range could 96 optimistically attempt to revive the quiesced group *in-place*, without starting 97 a new term, by resuming heartbeats and proposing commands. Any other node 98 wishing to revive the group would still need to initiate an election for a new 99 term, as would the former leader if it failed to get a majority of ACKs on the 100 first command it proposed after waking. This can be done purely as an 101 optimization at a later time.