github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/raft-snapshots.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/raft-snapshots.md (about)

1 # Raft snapshots and why you see them when you oughtn't
2
3 Original author: Tobias Grieger
4
5 # Introduction
6
7 Each and every Range in CockroachDB is fundamentally a *Raft* group
8 (with a lot of garnish around it). Raft is a consensus protocol that
9 provides a distributed state machine being fed an ordered log of
10 commands, the *Raft log*. Raft is a leader-based protocol that keeps
11 leadership stable, which means that typically one of the members of
12 the group is the designated leader and is in charge of distributing
13 the log to followers and marking log entries as committed.
14
15 In typical operation (read: “practically always”) the process that
16 plays out with each write to a Range is that each of the members of
17 the Raft group receives new entries (corresponding to some digested
18 form of the write request) from the leader, which it acknowledges. A
19 little later, the leader sends an updated *Commit index* instructing the
20 followers to apply the current (and any preceding) writes to their
21 state, in the order given by the log.
22
23 A *Raft snapshot* is necessary when the leader does not have the log
24 entries a follower needs and thus needs to send a full initial
25 state. This happens in two scenarios:
26
27 1. Log truncation. It’s not economical to store all of the Raft log in
28 perpetuity. For example, imagine a workload that updates a single
29 key-value pair over and over (no MVCC). The Raft log’s growth is
30 unbounded, yet the size of the state is constant.
31
32 When all followers have applied a command, a prefix of the log up
33 to that command can be discarded. But followers can be unavailable
34 for extended periods of time; it may not always be wise to wait for
35 *all* followers to have caught up, especially if the Raft log has
36 already grown very large. If we don’t wait and an affected follower
37 wants to catch up, it has to do so through a snapshot.
38
39 2. A configuration change adds a new follower to the Raft group. For
40 example, a three member group gets turned into a four member
41 group. Naively, the new member starts with a clean slate, and would
42 need all of the Raft log. The details are a bit in the weeds, but
43 even if the log had never been truncated (see 1.) we’d still need a
44 snapshot, for our initial state is not equal to the empty state
45 (i.e. if a replica applies all of the log on top of an empty state,
46 it wouldn’t end up in the correct state).
47
48 # Terminology
49
50 We have already introduced *Raft* and take it that you know what a *Range*
51 is. Here is some more terminology and notation which is also helpful
52 to decipher the logs should you ever have to.
53
54 **Replica:** a member of a Range, represented at runtime by a
55 `*storage.Replica` object (housed by a `*storage.Store`). A Range
56 typically has an odd number of Replicas, and practically always three
57 or five.
58
59 **Replica ID:** A Range is identified by a Range ID, but Raft doesn’t even
60 know which Range it is handling (we do all of that routing in our
61 transport layer). All Raft knows is the members of the group, which
62 are identified by *Replica IDs*. For each range, the initial Replica IDs
63 are 1, 2, 3, and each newly added Replica is assigned the next higher
64 (and never previously used) one [1]. For example, if a Range has Replicas
65 with IDs 1, 2, 4, then you can deduce that member 3 was at some point
66 removed from the range and a new replica 4 added (in some unknown
67 order).
68
69 [1] The replica ID counter is stored in the range descriptor.
70
71 The notation you would find in the logs for a member with Replica ID 3
72 of a Range with ID 17 is r17/3. Raft snapshot: a snapshot that is
73 requested internally by Raft. This is different from a snapshot that
74 CockroachDB sends preemptively, see below.
75
76 **Raft snapshot queue:** a Store-level queue that sends the snapshots
77 requested by Raft. Prior to the introduction of the queue, Raft
78 snapshots would often bring a cluster to its knees by overloading the
79 system. The queue makes sure there aren’t ever too many Raft snapshots
80 in flight at the same time.
81
82 **Snapshot reservation:** A Store-level mechanism that sits on the receive
83 path of a snapshot, to make sure we’re not accepting too many
84 (preemptive or Raft) snapshots at once (which could overload the
85 node).
86
87 **Replica Change (or Replication Change):** the process of adding or
88 removing a Replica from/to a Range. An addition usually involves a:
89
90 **Preemptive snapshot:** a snapshot we send before adding a replica to the
91 group. The idea is that we’ll opportunistically place state on the
92 store that will house the replica once it is added, so that it can
93 catch up from the Raft log instead of needing a Raft snapshot. This
94 minimizes the time during which the follower is part of the commit
95 quorum but unable to participate in said quorum.
96
97 A preemptive snapshot is rate limited (at time of writing to 2mb/s)
98 and creates a “placeholder” Replica that doesn’t have an associated
99 Raft group (and has Replica ID zero) that is typically short-lived (as
100 it either turns into a full Replica, or gets deleted). See:
101
102 **Replica GC:** the process of deleting Replicas that are not members
103 of their Raft group any more. Determining whether a Replica is GC’able
104 is surprisingly tricky.
105
106 **Quota pool:** A flow control mechanism that prevents “faster” replicas
107 from leaving a slower (but “functional”, for example on a slower
108 network connection) Replica behind. Without a quota pool, the Raft
109 group would end up in a weakened state where losing one of the
110 up-to-date replicas incurs an unavailability until the straggler has
111 caught up. It can also trigger periodic Raft log truncations that cut
112 off the slow follower, which requires regular Raft snapshots.
113
114 **Quorum:** strictly more than half of the number of Replicas in a
115 Range. For example, with three Replicas, quorum is two. With four
116 Replicas, quorum is three. With five Replicas, quorum is three.
117
118 **Log truncation:** the process of truncating the logs. The Raft
119 leader computes this index periodically and sends out a Raft command
120 that instructs replicas to delete a given prefix of their Raft
121 logs. These operations are triggered by the Raft log queue.
122
123 **Range Split (or just Split):** an operation that splits a Range into
124 two Ranges by shrinking a Range (from the right) and creating a new
125 one (with members on the same stores) from the remainder. For example,
126 r17 with original bounds [a,z) might split into r17 with bounds [a,c)
127 and r18 with bounds [c,z). The right hand side is initialized with the
128 data from the (original) left hand side Replica.
129
130 **Range Merge (or just Merge):** the reverse of a Split, but more
131 complex to implement than a Split for a variety of reasons, the most
132 relevant to this document being that it needs the Replicas to be
133 colocated (i.e. with replicas on the same stores).
134
135 We’ll discuss the interplay between the various mechanisms above in
136 more detail, but this is a good moment to reflect on the title of this
137 document. Raft snapshots are expensive in that they transfer loads of
138 data around; it’s typically the right choice to avoid them. When could
139 they pop up outside of failure situations?
140
141 1. We control replication changes and have preemptive snapshots (which
142 still move the same amount of data, but can happen at a leisurely
143 pace without running a vulnerable Raft group); no Raft snapshots
144 are involved from the looks of it
145
146 2. We control Raft log truncation; if followers are online, the quota
147 pool makes sure that the followers are always at “comparable” log
148 positions, so a (significant) prefix of the log that can be
149 truncated without causing a Raft snapshot should always be
150 available.
151
152 3. A split shouldn’t require a Raft snapshot: Before the split, all
153 the data is on whichever stores the replicas are on. After the
154 split, the data is in the same place, just kept across two ranges -
155 a bookkeeping change.
156
157 4. Merges also shouldn’t need Raft snapshots. They’re doing replica
158 changes first (see above) followed by a bookkeeping change that
159 doesn’t move data any more.
160
161 It seems that the only “reasonable” situation in which you’d expect
162 Raft snapshots is if a node goes offline for some time and the Raft
163 log grows so much that we’d prefer a snapshot for catching up the node
164 once it comes back.
165
166 Pop quiz: how many Raft snapshots would you expect if you started a
167 fresh 15 node cluster on a 2.1.2-alpha SHA from November 2018, and
168 imported a significant amount (TBs) of data into it?
169
170 ![unexpected snapshot diagram](raft-snapshots.png)
171
172 I bet ~15,000 wasn’t exactly what you were thinking of, and yet this
173 was a problem that was long overlooked and just recently addressed by
174 means of mitigation rather than resolution of root causes. It’s not
175 just expensive to have that many snapshots. You can see that thousands
176 of snapshots are requested, and the graph isn’t exactly in a hurry to
177 dip back towards zero. This is in part because the Raft snapshot queue
178 had (and has) efficiency concerns, but fundamentally consider that
179 each queued snapshot might correspond to transferring ~64mb of data
180 over the network (closer to ~32mb in practice). That’s on the order of
181 hundreds of gigabytes, i.e. it’ll take time no matter how hard we try.
182
183 At the end of this document, you will have an understanding of how
184 this graph came to be, how it will look today (spoiler: mostly a zero
185 flatline), and what the root causes are.
186
187 # Fantastic Raft snapshots and where to find them
188
189 We will now go into the details and discover how we might see Raft
190 snapshots and how they might proliferate as seen in the above
191 graph. In doing so, we’ll discover a “meta root cause” which is “poor
192 interplay between mechanisms that have been mentioned above”. To
193 motivate this, consider the following conflicts of interest:
194
195 - Log truncation wants to keep the log very short (this is also a
196 performance concern). Snapshots want their starting index to not be
197 truncated away (or they’re useless).
198 - ReplicaGC wants to remove old data as fast as possible. Replication
199 changes want their preemptive snapshots kept around until they have
200 completed.
201
202 It’s a good warm-up exercise to go through why these conflicts would cause a snapshot.
203
204 ## Problem 1: Truncation cuts off an in-flight snapshot
205
206 Say a replication change sends a preemptive snapshot at log
207 index 123. Some writes arrive to the range; the Raft log queue checks
208 the state of the group, sees three up-to-date followers, and issues a
209 truncation to index 127. The preemptive snapshot arrives and is
210 applied. The replication change completes and a new Replica is formed
211 from the preemptive snapshot. It will immediately require a Raft
212 snapshot because it needs index 124 next, but the leader only has 127
213 and higher.
214
215 **Solution** (already in place): track the index of in-flight snapshots
216 and don’t truncate them away. Similar mechanisms make sure that we
217 don’t cut off followers that are lagging behind slightly, unless we
218 really really think it’s the right thing to do.
219
220 ## Problem 2: Preemptive snapshots accidentally replicaGC’ed
221
222 Again a preemptive snapshot arrives at a Store. While the replication
223 change hasn’t completed, this preemptive Replica is gc’able and may be
224 removed if picked up by the replica GC queue. If this happens and the
225 replication change completes, the newly added follower will have a
226 clean slate and will need a Raft snapshot to recover, putting the
227 Range in a vulnerable state until it has done so.
228
229 **Solution** (unimplemented): delay the GC if we know that the replica
230 change is still ongoing. Note that this is more difficult than it
231 sounds. The “driver” of the replica change is another node, and for
232 all we know it might’ve aborted the change (in which case the snapshot
233 needs to be deleted ASAP). Simply delaying GC of preemptive snapshots
234 via a timeout mechanism is tempting, but can backfire: a “wide”
235 preemptive snapshot may be kept around erroneously and can block
236 “smaller” snapshots that are now sent after the Range has split into
237 smaller parts. We can detect this situation, but perhaps it is simpler
238 to add a direct RPC from the “driver” to the preemptive snapshot to
239 keep it alive for the right amount of time.
240
241 These two were relatively straightforward. If we throw splits in the mix, things get a lot worse.
242
243 ## Problem 3: Splits can exponentiate the effect of problem 2
244
245 Consider the situation in problem 2, but between the preemptive
246 snapshot being applied and the replication change succeeding, the
247 range is split 100 times (splitting small ranges is fast and this
248 happens during IMPORT/RESTORE). The preemptive snapshot is removed, so
249 the Replica will receive a new snapshot. This snapshot is likely to
250 post-date all of the splits, so it will be much “narrower” than
251 before. In applying it, everything outside of its bounds is deleted,
252 and that’s the data for the 100 new ranges that we will now need Raft
253 snapshots for.
254
255 **Solution:** This is not an issue if Problem 2 is solved, but see
256 Problem 5 for a related problem.
257
258 ## Problem 4: The split trigger can cause Raft snapshots
259
260 This needs a bit of detail about how splits are implemented. A split
261 is essentially a transaction with a special “split trigger” attached
262 to its commit. Upon applying the commit, each replica carries out the
263 logical split into two ranges as described by the split trigger,
264 culminating in the creation of a new Replica (the right-hand side) and
265 its initialization with data (inherited from the left-hand side). This
266 is conceptually easy, but things always get more complicated in a
267 distributed system. Note that we have multiple replicas that may be
268 applying the split trigger at different times. What if one replica
269 took, say, a second longer until it started applying it? All replicas
270 except for the straggler will already have initialized their new
271 right-hand sides, and the right-hand sides are configured to talk to
272 the right-hand side on our straggler - which hasn’t executed the split
273 trigger yet! The result is that upon receiving messages for this
274 yet-nonexistent replica, it will be created as an absolutely empty
275 replica. This replica will promptly [2] tell the leader that it needs a
276 snapshot. In all likelihood, the split trigger is going to execute
277 very soon and render the snapshot useless, but the damage has been
278 done.
279
280 [2] The part that already processed the split may form a quorum and
281 start accept requests. These requests will be routed also to the node
282 that hasn't processed the split yet. To process those requests, the
283 slower node needs a snapshot.
284
285 **Solution:** what’s implemented today is a mechanism that simply
286 delays snapshots to “blank” replicas for multiple seconds. This works,
287 but there are situations in which the “blank” replica really needs the
288 Raft snapshot (see Problems 3,5,6). The idea is to replace this with a
289 mechanism that can detect whether there’s an outstanding split trigger
290 that will obviate the snapshot.
291
292 ## Problem 5: Replica removal after a split can cause Raft snapshots
293
294 This is Problem 4, but with a scenario that really needs the Raft
295 snapshot. Imagine a range splits into two and immediately afterwards,
296 a replication change (on the original range, i.e. the left-hand side)
297 removes one of the followers. This can lead to that follower Replica
298 being replica GC’ed before it has applied the split trigger, and so
299 the data removed encompasses that that should become the right-hand
300 side (which is still configured to have a replica on that store). A
301 Raft snapshot is necessary and in particular note that the
302 timeout-based mechanism in Solution 4 will have to time out first and
303 accidentally delays this snapshot.
304
305 **Solution:** See end of Solution 4 to avoid the delayed snapshot. To
306 avoid the need for a snapshot in the first place, avoid (or delay)
307 removing replicas that are pending application of a split trigger, or
308 do the opposite and delay splits when a replication change is
309 ongoing. Not implemented to date. Both options are problematic as they
310 can intertwine the split and replicate queues, which already have
311 hidden dependencies.
312
313 ## Problem 6: Splitting a range that needs snapshots leads to two such ranges
314
315 If a range which has a follower that needs a snapshot (for whatever
316 reason) is split, the result will be two Raft snapshots required to
317 catch up the left-hand side and right hand side of the follower,
318 respectively. You might argue that these snapshots will be smaller and
319 add up to the original size and so that’s ok, but this isn’t how it
320 works owing to various deficiencies of the Raft snapshot queue.
321
322 This is particularly bad since we also know from Problem 4 and 5 that
323 we sometimes accidentally delay the Raft snapshot for seconds (before
324 we even queue it), leaving ample time for splits to create more
325 snapshots. To add insult to injury, all new ranges created in this
326 state are also false positives for the timeout mechanism in Problem 4.
327
328 Note that this is similar to Problem 3 in some sense but it can’t be detected by the split.
329
330 **Solution:** Splits typically happen on the Raft leader and can thus
331 peek into the Raft state and detect pending snapshots. A mechanism to
332 delay the split in such cases was implemented, though it doesn’t apply
333 to splits carried out through the “split queue” (as that queue does
334 not react well to being throttled). The main scenario in which this
335 problem was observed is IMPORT/RESTORE, which carries out rapid splits
336 and replication changes.
337
338 ## Commonalities
339
340 All of these phenomena are in principle observable by running the
341 restore2TB and import/tpch/nodes=X roachtests, though at this point
342 all that’s left of the problem is a small spike in Raft snapshots at
343 the beginning of the test (the likely root cause being Problem 3),
344 which used to be exponentiated by Problem 6 (before it was fixed).
345
346 # Summary
347
348 Raft snapshots shouldn’t be necessary in any of the situations
349 described above, and considerable progress has been made towards
350 making that at reality. The mitigation of Solution 6 is powerful and
351 guards against the occasional Raft snapshot becoming a problem, but it
352 is nevertheless worth going the extra mile to add “deep stability” to
353 this property: a zero rate of Raft snapshots is a good indicator of
354 cluster health for production and stability testing. Trying to achieve
355 it has already led to the discovery of a number of surprising
356 interactions and buglets across the core code base, and not achieving
357 it will make it hard to end-to-end test this property.