github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160730_streaming_snapshots.md (about)

     1  - Feature Name: streaming_snapshots
     2  - Status: in-progress
     3  - Start Date: 2016-07-30
     4  - Authors: bdarnell
     5  - RFC PR: [#8151](https://github.com/cockroachdb/cockroach/pull/8151)
     6  - Cockroach Issue: [#7551](https://github.com/cockroachdb/cockroach/issues/7551)
     7  
     8  # Summary
     9  
    10  This RFC proposes sending raft snapshots via a new streaming protocol,
    11  separate from regular raft messages. This will provide better control
    12  of concurrent snapshots and reduce peak memory usage.
    13  
    14  # Motivation
    15  
    16  `etcd/raft` transmits snapshots as a single blob, all of which is held
    17  in memory at once (several times over, due to the layers of encoding).
    18  This forces us to limit range sizes to a small fraction of available
    19  memory so that snapshot handling does not exhaust available memory.
    20  
    21  Additionally, `etcd/raft` does not give us much control over when
    22  these snapshots are sent, and despite our attempts to limit concurrent
    23  snapshot use (including throttling in the `Snapshot()` method itself
    24  and reservations in the replication queue), it is likely for multiple
    25  snapshots to be sent around the same time, amplifying memory problems.
    26  
    27  Finally, our current raft transport protocol is based on asynchronous
    28  messaging, making it difficult for the sender of a message to know
    29  that it has been processed and a new message can be sent.
    30  
    31  The changes proposed in this RFC will
    32  - Allow nodes to signal whether or not they are able to accept a
    33    snapshot before it is sent
    34  - Inform the sender of a snapshot when it has been applied
    35    successfully (or failed)
    36  - Allow snapshots to be applied in chunks instead of all at once
    37  
    38  # Detailed design
    39  
    40  These changes will be implemented in two phases. In the first phase,
    41  we introduce the new network protocol and use it to ensure that both
    42  senders and receivers can limit the number of snapshots they are
    43  processing at once. In the second phase we modify the `applySnapshot`
    44  method to be aware of the streaming protocol and process the snapshots
    45  in smaller chunks.
    46  
    47  ## Network protocol
    48  
    49  We introduce a new streaming RPC in the `MultiRaft` GRPC service:
    50  
    51  ``` protocol-buffer
    52  message SnapshotRequest {
    53    message Header {
    54      optional roachpb.RangeDescriptor range_descriptor = 1 [(gogoproto.nullable) = false];
    55  
    56      // The inner raft message is of type MsgSnap, and its snapshot data contains a UUID.
    57      optional RaftMessageRequest raft_message_request = 2 [(gogoproto.nullable) = false];
    58  
    59      // The estimated size of the range, to be used in reservation decisions.
    60      optional int64 range_size = 3 [(gogoproto.nullable) = false];
    61  
    62      // can_decline is set on preemptive snapshots, but not those generated
    63      // by raft because at that point it is better to queue up the stream
    64      // than to cancel it.
    65      optional bool can_decline = 4 [(gogoproto.nullable) = false];
    66    }
    67  
    68    optional Header header = 1;
    69  
    70    // A RocksDB BatchRepr. Multiple kv_batches may be sent across multiple request messages.
    71    optional bytes kv_batch = 2 [(gogoproto.customname) = "KVBatch"];
    72  
    73    // These are really raftpb.Entry, but we model them as raw bytes to avoid
    74    // roundtripping through memory. They are separate from the kv_batch to
    75    // allow flexibility in log implementations.
    76    repeated bytes log_entries = 3;
    77  
    78    optional bool final = 4 [(gogoproto.nullable) = false];
    79  }
    80  
    81  message SnapshotResponse {
    82    enum Status {
    83      ACCEPTED = 1;
    84      APPLIED = 2;
    85      ERROR = 3;
    86      DECLINED = 4;
    87    }
    88    optional Status status = 1 [(gogoproto.nullable) = false];
    89    optional string message = 2 [(gogoproto.nullable) = false];
    90  }
    91  
    92  service MultiRaft {
    93    ...
    94    rpc RaftSnapshot (stream SnapshotRequest) returns (stream SnapshotResponse) {}
    95  }
    96  ```
    97  
    98  The protocol is inspired by HTTP's `Expect: 100-continue` mechanism.
    99  The sender creates a `RaftSnapshot` stream and sends a
   100  `SnapshotRequest` containing only a `Header` (no other message
   101  includes a `Header`). The recipient may either accept the snapshot by
   102  sending a response with `status=ACCEPTED`, reject the snapshot
   103  permanently (for example, if it has a conflicting range) by sending a
   104  response with `status=ERROR` and closing the stream, or stall the
   105  snapshot temporarily (for example, if it is currently processing too
   106  many other snapshots) by doing nothing and keeping the stream open.
   107  The recipient may make this decision either using the reservation
   108  system or by a separate store-wide counter of pending snapshots.
   109  
   110  When the snapshot has been accepted, the sender sends one or more
   111  additional `SnapshotRequests`, each containing KV data and/or log
   112  entries (no log entries are sent before the last KV batch). The last
   113  request will have the `final` flag set. After receiving a `final`
   114  message, the recipient will apply the snapshot. When it is done, it
   115  sends a second response, with `status=APPLIED` or `status=ERROR` and
   116  closes the stream.
   117  
   118  ## Sender implementation
   119  
   120  When a snapshot is required a multi-step interaction takes place
   121  between `etcd/raft` and the `Replica`. First, raft calls
   122  `replica.Snapshot()` to generate and encode the snapshot data (along
   123  with some metadata). Second, the `Ready` struct will include an
   124  outgoing `MsgSnap` containing that data and the recipient's ID. Since
   125  the `Snapshot()` call does not say where the snapshot is to be sent,
   126  some indirection is necessary.
   127  
   128  `Replica.Snapshot` will generate a UUID and capture a RocksDB
   129  snapshot. The UUID is returned to raft as the contents of the snapshot
   130  (along with the metadata required by raft). The UUID and RocksDB
   131  snapshot are saved in attributes of the `Replica`. In
   132  `sendRaftMessage`, we inspect all outgoing `MsgSnap` messages. If
   133  it doesn't match our saved UUID, we discard the message (this
   134  shouldn't happen). If it matches, we begin to send SnapshotRequests as
   135  described above; the `MsgSnap` is held to be sent in the snapshot's
   136  `Header`.
   137  
   138  ## Recipient implementation
   139  
   140  Applying snapshots in a streaming fashion introduces some subtleties
   141  around concurrency, so the initial implementation of streaming
   142  snapshots will continue to apply the snapshot as one unit.
   143  
   144  ### Phase 1
   145  
   146  The recipient will accumulate all `SnapshotRequests` in memory, into a
   147  rocksdb `Batch`, keyed by the UUID from the header `raftpb.Message`.
   148  It sends the header's `MsgSnap` into raft, and if `raft.Ready` returns
   149  a snapshot to be applied with the given UUID (this is not guaranteed),
   150  the buffered snapshot will be committed.
   151  
   152  ### Phase 2
   153  
   154  In phase 2, chunks of the snapshot are applied as they come in,
   155  instead of a single RocksDB batch. Because this leaves the replica in
   156  a visibly inconsistent state, it cannot be used for anything else
   157  during this process.
   158  
   159  In this mode, the `MsgSnap` is sent to raft at the beginning of the
   160  process instead of the end. If raft tells us to apply the snapshot, we
   161  destroy our existing data to make room for the snapshot. Once we have
   162  done so, we cannot do anything else with this replica (including
   163  sending any raft messages, especially the `MsgAppResp` that raft asks
   164  us to send when it gives us the snapshots) until we have consumed and
   165  applied the entire stream of snapshot data.
   166  
   167  Error handling here is tricky: we've already discarded our old data,
   168  so we can't do anything else until we apply a snapshot. If the stream
   169  is closed without sending the final snapshot packet, we must mark the
   170  replica as corrupt.
   171  
   172  # Drawbacks
   173  
   174  - More complexity
   175  - Phase 2 introduces new sources of replica corruption errors
   176  - More exposure to raft implementation details
   177  
   178  # Alternatives
   179  
   180  - Upstream changes to the `raft.Storage` interface (to pass the
   181    recipient ID to the `Snapshot` method) could simplify things a bit
   182    on the sender side.
   183  
   184  # Unresolved questions
   185  
   186  - Can this replace the reservation system, or do we need both? It
   187    should probably be integrated, but they're not quite the same - for
   188    preemptive snapshots we want to return `DECLINED` so the replication
   189    queue can pick another target, but raft-generated snapshots cannot
   190    be declined (since they cannot be retried elsewhere) and instead
   191    should be queued up until space is available.
   192  - It should be possible to recover from a failed snapshot by receiving
   193    a new snapshot without marking the replica as corrupt. What would be
   194    required to make this work?