github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160730_streaming_snapshots.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160730_streaming_snapshots.md (about)

1 - Feature Name: streaming_snapshots
2 - Status: in-progress
3 - Start Date: 2016-07-30
4 - Authors: bdarnell
5 - RFC PR: [#8151](https://github.com/cockroachdb/cockroach/pull/8151)
6 - Cockroach Issue: [#7551](https://github.com/cockroachdb/cockroach/issues/7551)
7
8 # Summary
9
10 This RFC proposes sending raft snapshots via a new streaming protocol,
11 separate from regular raft messages. This will provide better control
12 of concurrent snapshots and reduce peak memory usage.
13
14 # Motivation
15
16 `etcd/raft` transmits snapshots as a single blob, all of which is held
17 in memory at once (several times over, due to the layers of encoding).
18 This forces us to limit range sizes to a small fraction of available
19 memory so that snapshot handling does not exhaust available memory.
20
21 Additionally, `etcd/raft` does not give us much control over when
22 these snapshots are sent, and despite our attempts to limit concurrent
23 snapshot use (including throttling in the `Snapshot()` method itself
24 and reservations in the replication queue), it is likely for multiple
25 snapshots to be sent around the same time, amplifying memory problems.
26
27 Finally, our current raft transport protocol is based on asynchronous
28 messaging, making it difficult for the sender of a message to know
29 that it has been processed and a new message can be sent.
30
31 The changes proposed in this RFC will
32 - Allow nodes to signal whether or not they are able to accept a
33 snapshot before it is sent
34 - Inform the sender of a snapshot when it has been applied
35 successfully (or failed)
36 - Allow snapshots to be applied in chunks instead of all at once
37
38 # Detailed design
39
40 These changes will be implemented in two phases. In the first phase,
41 we introduce the new network protocol and use it to ensure that both
42 senders and receivers can limit the number of snapshots they are
43 processing at once. In the second phase we modify the `applySnapshot`
44 method to be aware of the streaming protocol and process the snapshots
45 in smaller chunks.
46
47 ## Network protocol
48
49 We introduce a new streaming RPC in the `MultiRaft` GRPC service:
50
51 ``` protocol-buffer
52 message SnapshotRequest {
53 message Header {
54 optional roachpb.RangeDescriptor range_descriptor = 1 [(gogoproto.nullable) = false];
55
56 // The inner raft message is of type MsgSnap, and its snapshot data contains a UUID.
57 optional RaftMessageRequest raft_message_request = 2 [(gogoproto.nullable) = false];
58
59 // The estimated size of the range, to be used in reservation decisions.
60 optional int64 range_size = 3 [(gogoproto.nullable) = false];
61
62 // can_decline is set on preemptive snapshots, but not those generated
63 // by raft because at that point it is better to queue up the stream
64 // than to cancel it.
65 optional bool can_decline = 4 [(gogoproto.nullable) = false];
66 }
67
68 optional Header header = 1;
69
70 // A RocksDB BatchRepr. Multiple kv_batches may be sent across multiple request messages.
71 optional bytes kv_batch = 2 [(gogoproto.customname) = "KVBatch"];
72
73 // These are really raftpb.Entry, but we model them as raw bytes to avoid
74 // roundtripping through memory. They are separate from the kv_batch to
75 // allow flexibility in log implementations.
76 repeated bytes log_entries = 3;
77
78 optional bool final = 4 [(gogoproto.nullable) = false];
79 }
80
81 message SnapshotResponse {
82 enum Status {
83 ACCEPTED = 1;
84 APPLIED = 2;
85 ERROR = 3;
86 DECLINED = 4;
87 }
88 optional Status status = 1 [(gogoproto.nullable) = false];
89 optional string message = 2 [(gogoproto.nullable) = false];
90 }
91
92 service MultiRaft {
93 ...
94 rpc RaftSnapshot (stream SnapshotRequest) returns (stream SnapshotResponse) {}
95 }
96 ```
97
98 The protocol is inspired by HTTP's `Expect: 100-continue` mechanism.
99 The sender creates a `RaftSnapshot` stream and sends a
100 `SnapshotRequest` containing only a `Header` (no other message
101 includes a `Header`). The recipient may either accept the snapshot by
102 sending a response with `status=ACCEPTED`, reject the snapshot
103 permanently (for example, if it has a conflicting range) by sending a
104 response with `status=ERROR` and closing the stream, or stall the
105 snapshot temporarily (for example, if it is currently processing too
106 many other snapshots) by doing nothing and keeping the stream open.
107 The recipient may make this decision either using the reservation
108 system or by a separate store-wide counter of pending snapshots.
109
110 When the snapshot has been accepted, the sender sends one or more
111 additional `SnapshotRequests`, each containing KV data and/or log
112 entries (no log entries are sent before the last KV batch). The last
113 request will have the `final` flag set. After receiving a `final`
114 message, the recipient will apply the snapshot. When it is done, it
115 sends a second response, with `status=APPLIED` or `status=ERROR` and
116 closes the stream.
117
118 ## Sender implementation
119
120 When a snapshot is required a multi-step interaction takes place
121 between `etcd/raft` and the `Replica`. First, raft calls
122 `replica.Snapshot()` to generate and encode the snapshot data (along
123 with some metadata). Second, the `Ready` struct will include an
124 outgoing `MsgSnap` containing that data and the recipient's ID. Since
125 the `Snapshot()` call does not say where the snapshot is to be sent,
126 some indirection is necessary.
127
128 `Replica.Snapshot` will generate a UUID and capture a RocksDB
129 snapshot. The UUID is returned to raft as the contents of the snapshot
130 (along with the metadata required by raft). The UUID and RocksDB
131 snapshot are saved in attributes of the `Replica`. In
132 `sendRaftMessage`, we inspect all outgoing `MsgSnap` messages. If
133 it doesn't match our saved UUID, we discard the message (this
134 shouldn't happen). If it matches, we begin to send SnapshotRequests as
135 described above; the `MsgSnap` is held to be sent in the snapshot's
136 `Header`.
137
138 ## Recipient implementation
139
140 Applying snapshots in a streaming fashion introduces some subtleties
141 around concurrency, so the initial implementation of streaming
142 snapshots will continue to apply the snapshot as one unit.
143
144 ### Phase 1
145
146 The recipient will accumulate all `SnapshotRequests` in memory, into a
147 rocksdb `Batch`, keyed by the UUID from the header `raftpb.Message`.
148 It sends the header's `MsgSnap` into raft, and if `raft.Ready` returns
149 a snapshot to be applied with the given UUID (this is not guaranteed),
150 the buffered snapshot will be committed.
151
152 ### Phase 2
153
154 In phase 2, chunks of the snapshot are applied as they come in,
155 instead of a single RocksDB batch. Because this leaves the replica in
156 a visibly inconsistent state, it cannot be used for anything else
157 during this process.
158
159 In this mode, the `MsgSnap` is sent to raft at the beginning of the
160 process instead of the end. If raft tells us to apply the snapshot, we
161 destroy our existing data to make room for the snapshot. Once we have
162 done so, we cannot do anything else with this replica (including
163 sending any raft messages, especially the `MsgAppResp` that raft asks
164 us to send when it gives us the snapshots) until we have consumed and
165 applied the entire stream of snapshot data.
166
167 Error handling here is tricky: we've already discarded our old data,
168 so we can't do anything else until we apply a snapshot. If the stream
169 is closed without sending the final snapshot packet, we must mark the
170 replica as corrupt.
171
172 # Drawbacks
173
174 - More complexity
175 - Phase 2 introduces new sources of replica corruption errors
176 - More exposure to raft implementation details
177
178 # Alternatives
179
180 - Upstream changes to the `raft.Storage` interface (to pass the
181 recipient ID to the `Snapshot` method) could simplify things a bit
182 on the sender side.
183
184 # Unresolved questions
185
186 - Can this replace the reservation system, or do we need both? It
187 should probably be integrated, but they're not quite the same - for
188 preemptive snapshots we want to return `DECLINED` so the replication
189 queue can pick another target, but raft-generated snapshots cannot
190 be declined (since they cannot be retried elsewhere) and instead
191 should be queued up until space is available.
192 - It should be possible to recover from a failed snapshot by receiving
193 a new snapshot without marking the replica as corrupt. What would be
194 required to make this work?