github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160730_streaming_snapshots.md (about) 1 - Feature Name: streaming_snapshots 2 - Status: in-progress 3 - Start Date: 2016-07-30 4 - Authors: bdarnell 5 - RFC PR: [#8151](https://github.com/cockroachdb/cockroach/pull/8151) 6 - Cockroach Issue: [#7551](https://github.com/cockroachdb/cockroach/issues/7551) 7 8 # Summary 9 10 This RFC proposes sending raft snapshots via a new streaming protocol, 11 separate from regular raft messages. This will provide better control 12 of concurrent snapshots and reduce peak memory usage. 13 14 # Motivation 15 16 `etcd/raft` transmits snapshots as a single blob, all of which is held 17 in memory at once (several times over, due to the layers of encoding). 18 This forces us to limit range sizes to a small fraction of available 19 memory so that snapshot handling does not exhaust available memory. 20 21 Additionally, `etcd/raft` does not give us much control over when 22 these snapshots are sent, and despite our attempts to limit concurrent 23 snapshot use (including throttling in the `Snapshot()` method itself 24 and reservations in the replication queue), it is likely for multiple 25 snapshots to be sent around the same time, amplifying memory problems. 26 27 Finally, our current raft transport protocol is based on asynchronous 28 messaging, making it difficult for the sender of a message to know 29 that it has been processed and a new message can be sent. 30 31 The changes proposed in this RFC will 32 - Allow nodes to signal whether or not they are able to accept a 33 snapshot before it is sent 34 - Inform the sender of a snapshot when it has been applied 35 successfully (or failed) 36 - Allow snapshots to be applied in chunks instead of all at once 37 38 # Detailed design 39 40 These changes will be implemented in two phases. In the first phase, 41 we introduce the new network protocol and use it to ensure that both 42 senders and receivers can limit the number of snapshots they are 43 processing at once. In the second phase we modify the `applySnapshot` 44 method to be aware of the streaming protocol and process the snapshots 45 in smaller chunks. 46 47 ## Network protocol 48 49 We introduce a new streaming RPC in the `MultiRaft` GRPC service: 50 51 ``` protocol-buffer 52 message SnapshotRequest { 53 message Header { 54 optional roachpb.RangeDescriptor range_descriptor = 1 [(gogoproto.nullable) = false]; 55 56 // The inner raft message is of type MsgSnap, and its snapshot data contains a UUID. 57 optional RaftMessageRequest raft_message_request = 2 [(gogoproto.nullable) = false]; 58 59 // The estimated size of the range, to be used in reservation decisions. 60 optional int64 range_size = 3 [(gogoproto.nullable) = false]; 61 62 // can_decline is set on preemptive snapshots, but not those generated 63 // by raft because at that point it is better to queue up the stream 64 // than to cancel it. 65 optional bool can_decline = 4 [(gogoproto.nullable) = false]; 66 } 67 68 optional Header header = 1; 69 70 // A RocksDB BatchRepr. Multiple kv_batches may be sent across multiple request messages. 71 optional bytes kv_batch = 2 [(gogoproto.customname) = "KVBatch"]; 72 73 // These are really raftpb.Entry, but we model them as raw bytes to avoid 74 // roundtripping through memory. They are separate from the kv_batch to 75 // allow flexibility in log implementations. 76 repeated bytes log_entries = 3; 77 78 optional bool final = 4 [(gogoproto.nullable) = false]; 79 } 80 81 message SnapshotResponse { 82 enum Status { 83 ACCEPTED = 1; 84 APPLIED = 2; 85 ERROR = 3; 86 DECLINED = 4; 87 } 88 optional Status status = 1 [(gogoproto.nullable) = false]; 89 optional string message = 2 [(gogoproto.nullable) = false]; 90 } 91 92 service MultiRaft { 93 ... 94 rpc RaftSnapshot (stream SnapshotRequest) returns (stream SnapshotResponse) {} 95 } 96 ``` 97 98 The protocol is inspired by HTTP's `Expect: 100-continue` mechanism. 99 The sender creates a `RaftSnapshot` stream and sends a 100 `SnapshotRequest` containing only a `Header` (no other message 101 includes a `Header`). The recipient may either accept the snapshot by 102 sending a response with `status=ACCEPTED`, reject the snapshot 103 permanently (for example, if it has a conflicting range) by sending a 104 response with `status=ERROR` and closing the stream, or stall the 105 snapshot temporarily (for example, if it is currently processing too 106 many other snapshots) by doing nothing and keeping the stream open. 107 The recipient may make this decision either using the reservation 108 system or by a separate store-wide counter of pending snapshots. 109 110 When the snapshot has been accepted, the sender sends one or more 111 additional `SnapshotRequests`, each containing KV data and/or log 112 entries (no log entries are sent before the last KV batch). The last 113 request will have the `final` flag set. After receiving a `final` 114 message, the recipient will apply the snapshot. When it is done, it 115 sends a second response, with `status=APPLIED` or `status=ERROR` and 116 closes the stream. 117 118 ## Sender implementation 119 120 When a snapshot is required a multi-step interaction takes place 121 between `etcd/raft` and the `Replica`. First, raft calls 122 `replica.Snapshot()` to generate and encode the snapshot data (along 123 with some metadata). Second, the `Ready` struct will include an 124 outgoing `MsgSnap` containing that data and the recipient's ID. Since 125 the `Snapshot()` call does not say where the snapshot is to be sent, 126 some indirection is necessary. 127 128 `Replica.Snapshot` will generate a UUID and capture a RocksDB 129 snapshot. The UUID is returned to raft as the contents of the snapshot 130 (along with the metadata required by raft). The UUID and RocksDB 131 snapshot are saved in attributes of the `Replica`. In 132 `sendRaftMessage`, we inspect all outgoing `MsgSnap` messages. If 133 it doesn't match our saved UUID, we discard the message (this 134 shouldn't happen). If it matches, we begin to send SnapshotRequests as 135 described above; the `MsgSnap` is held to be sent in the snapshot's 136 `Header`. 137 138 ## Recipient implementation 139 140 Applying snapshots in a streaming fashion introduces some subtleties 141 around concurrency, so the initial implementation of streaming 142 snapshots will continue to apply the snapshot as one unit. 143 144 ### Phase 1 145 146 The recipient will accumulate all `SnapshotRequests` in memory, into a 147 rocksdb `Batch`, keyed by the UUID from the header `raftpb.Message`. 148 It sends the header's `MsgSnap` into raft, and if `raft.Ready` returns 149 a snapshot to be applied with the given UUID (this is not guaranteed), 150 the buffered snapshot will be committed. 151 152 ### Phase 2 153 154 In phase 2, chunks of the snapshot are applied as they come in, 155 instead of a single RocksDB batch. Because this leaves the replica in 156 a visibly inconsistent state, it cannot be used for anything else 157 during this process. 158 159 In this mode, the `MsgSnap` is sent to raft at the beginning of the 160 process instead of the end. If raft tells us to apply the snapshot, we 161 destroy our existing data to make room for the snapshot. Once we have 162 done so, we cannot do anything else with this replica (including 163 sending any raft messages, especially the `MsgAppResp` that raft asks 164 us to send when it gives us the snapshots) until we have consumed and 165 applied the entire stream of snapshot data. 166 167 Error handling here is tricky: we've already discarded our old data, 168 so we can't do anything else until we apply a snapshot. If the stream 169 is closed without sending the final snapshot packet, we must mark the 170 replica as corrupt. 171 172 # Drawbacks 173 174 - More complexity 175 - Phase 2 introduces new sources of replica corruption errors 176 - More exposure to raft implementation details 177 178 # Alternatives 179 180 - Upstream changes to the `raft.Storage` interface (to pass the 181 recipient ID to the `Snapshot` method) could simplify things a bit 182 on the sender side. 183 184 # Unresolved questions 185 186 - Can this replace the reservation system, or do we need both? It 187 should probably be integrated, but they're not quite the same - for 188 preemptive snapshots we want to return `DECLINED` so the replication 189 queue can pick another target, but raft-generated snapshots cannot 190 be declined (since they cannot be retried elsewhere) and instead 191 should be queued up until space is available. 192 - It should be possible to recover from a failed snapshot by receiving 193 a new snapshot without marking the replica as corrupt. What would be 194 required to make this work?