github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170601_raft_sstable_sideloading.md (about)

     1  - Feature Name: Raft Proposal Sideloading for SSTable ingestion
     2  - Status: completed
     3  - Start Date: 2017-05-24
     4  - Authors: Dan Harrison and Tobias Schottdorf, original suggestion Ben Darnell
     5  - RFC PR: #16159
     6  - Cockroach Issue: #16263
     7  
     8  # Summary and non-goals
     9  
    10  Allow (small) SSTables ingestions to be proposed without putting the SSTable in
    11  the Raft log. An SSTable ingestion entails giving a file to RocksDB for direct
    12  inclusion in its underlying LSM. For such proposals, only metadata is stored in
    13  the Raft log, and the actual payload is written directly to the file system.
    14  This happens transparently on each node, i.e. the proposal is sent over the
    15  wire like a regular Raft proposal, with the payload inlined.
    16  
    17  It is explicitly a non-goal to deal with (overly) large proposals, as none of
    18  the Raft interfaces expect proposals of nontrivial size. It is not expected that
    19  sideloadable proposals will exceed a few MB in size.
    20  
    21  Specifying the creation and details of ingestion of the SSTable has its own
    22  subtleties and is left to a sister PR to appear shortly.
    23  
    24  # Motivation
    25  
    26  `RESTORE` needs a fast mechanism to ingest data into a Range. It wants to rely
    27  on Raft for simplicity, but the naive approach of sending a (chunked)
    28  `WriteBatch` has downsides:
    29  
    30  1. all of the data is written to the Raft log, where it incurs a large write
    31     amplification, and
    32  1. applying a large `WriteBatch` to RocksDB incurs another sizable write
    33     amplification factor: it is far better to link SSTables directly into the
    34     LSM.
    35  
    36  Instead, `RESTORE` sends small (initially ~2MB) SSTables which are to be linked
    37  directly into RocksDB. This eliminates the latter point above. However, it does
    38  not address the former, and this is what this RFC is about: achieving the
    39  optimal write amplification of 1 or 2 (the latter amounting to dodging some
    40  technical difficulties discussed at the end).
    41  
    42  Note also that housing the Raft log outside of RocksDB only addresses the former
    43  point if it results in a SSTable being stored inside of the RocksDB directory,
    44  which is likely to be a non-goal of that project.
    45  
    46  ## Detailed design
    47  
    48  The mechanism proposed here is the following:
    49  
    50  1. `storagebase.RaftCommand` gains fields `SideloadedHash []byte` (on
    51     `ReplicatedEvalResult`) and `SideloadedData []byte` (next to `WriteBatch`)
    52     which are unused for regular Raft proposals.
    53  1. When a proposal is to be sideloaded, a regular proposal is generated, but the
    54     sideloaded data and its hash populated. In the concrete case of `SSTable`
    55     this means that `evalAddSSTable` creates an essentially empty proposal
    56     (perhaps accounting for an MVCC stats delta).
    57  1.  On its way into Raft, the
    58     `SideloadedData` bytes are then written to disk and stripped from the
    59     proposal. On disk, the payload is stored in the `Store`'s directory,
    60     accessible via its log index, using the following scheme (for an explanation
    61     on why this scheme was chosen, see later sections):
    62  
    63     ```
    64     <storeDir>/sideloaded_proposals/<rangeID>.<replicaID>/<logIndex>.<term>
    65     ```
    66  
    67     where the file contains the raw sideloaded data, e.g. an `SSTable` that will
    68     populate `(storagebase.RaftCommand).SideloadedData`.
    69  1. Whenever the content of a sideloaded proposal is required, a `raftpb.Entry`
    70     is reconstituted from the corresponding file, while verifying the checksum in
    71     `SideloadedHash`. A failure of either operation is treated as fatal (a
    72     `ReplicaCorruptionError`). Note that restoring an entry is expensive: load
    73     the payload from disk, unmarshal the `RaftCommand`, compute the checksum and
    74     compare it to that stored in `RaftCommand`, put the payload into the
    75     `RaftCommand`, marshal the `RaftCommand`. The Raft entries cache should help
    76     mitigate this cost, though we should watch the overhead closely to check
    77     whether we can perhaps eagerly populate the cache when proposing.
    78  1. When such an entry is sent to the follower over the network
    79     (`sendRaftMessage`), the original proposal is sent (via the mechanism above).
    80     Note that this could happen via `append` or via the snapshotting mechanism
    81     (`sendSnapshot` or rather its child `iterateEntries`), which sends a part of
    82     the Raft log along with the snapshot.
    83  1. Apply-time side effects have access to the payload, and can make use of it
    84     under the contract that they do not alter the on-disk file. For the SSTable
    85     use case, when a sideloaded entry is applied to the state machine, the
    86     corresponding file is copied (hard-linking instead being an optimization
    87     discussed later) RocksDB directory and, from there, ingested by RocksDB.
    88  
    89  ## Determining whether a `raftpb.Entry` is sideloaded
    90  
    91  The principal difficulty is sniffing sideloadable proposals from a
    92  `(raftpb.Entry).Data` byte slice. This is necessary since sideloading happens at
    93  that fairly low level, and unmarshalling every single message squarely in the
    94  hot write path is clearly not an option. `(raftpb.Entry).Data` equals
    95  `encodeRaftCommand(p.idKey, data)`, where `data` is a marshaled
    96  `storagebase.RaftCommand`, resulting in
    97  
    98  ```
    99  (raftpb.Entry).Data = append(raftCommandEncodingVersion, idKey, data)
   100  ```
   101  
   102  We can't hide information in `data` as we would have to unmarshal it too often,
   103  and so we make the idiomatic choice: Sideloaded Raft proposals are sent using a
   104  new `raftCommandEncodingVersion`. See the next section for migrations.
   105  
   106  Armed with that, `(*Replica).append` and the other methods above can assure that
   107  there is no externally visible change to the way Raft proposals are handled, and
   108  we can drop sideloaded proposals directly on disk, where they are accessible via
   109  their log index.
   110  
   111  TODO: use cmdID instead? Or can we do something else entirely?
   112  
   113  ## Migration story
   114  
   115  The feature will initially only be used for SSTable ingestion, which remains
   116  behind a feature flag. An upgrade is carried out as follows:
   117  
   118  1. rolling cluster restart to new version
   119  1. enable the feature flag
   120  1. restore now uses SSTable ingestion
   121  
   122  A downgrade is slightly less stable but should work out OK in practice:
   123  
   124  1. disable the feature flag
   125  1. wait until it's clear that no SSTable ingestions are active anywhere (i.e.
   126     any such proposals have been applied by all replicas affected by them)
   127  1. either wait until all ingestions have been purged from the log (hard to
   128     control) or be prepared to manually remove the sideload directory after the
   129     downgrade (to avoid stale files that are never cleaned up).
   130  1. rolling cluster restart to old version.
   131  1. in the unlikely event of a node crashing, upgrade that node again and back to
   132     the first step
   133  
   134  Due to the feature flag, rolling updates are possible as usual.
   135  
   136  ## Details on file creation and removal
   137  
   138  There are three procedures that need to mutate sideloaded files. These are, in
   139  increasing difficulty, replica GC, log truncation, and `append()`. The main
   140  objectives here are making sure that we move the bulk of disk I/O outside of
   141  critical locks, and that all files are eventually cleaned up.
   142  
   143  ### Replica GC
   144  
   145  When a Replica is garbage collected, we know that if it is ever recreated, then
   146  it will be at a higher `ReplicaID`. For that reason, the sideloaded proposals
   147  are disambiguated by `ReplicaID`; we can postpone cleanup of these files until
   148  we don't hold any locks. During node startup, we check for sideloaded proposals
   149  that do not correspond to a Replica of their Store and remove them.
   150  
   151  Concretely,
   152  
   153  - after GC'ing the Replica, Replica GC wholesale removes the directory
   154    `<storeDir>/sideloaded_proposals/<rangeID>.<replicaID>` after releasing all
   155    locks.
   156  - on node startup, after all Replicas have been instantiated but before
   157    receiving traffic, delete those directories which do not correspond to a live
   158    replica. Assert that there is no directory for a replicaID larger than what we
   159    have instantiated.
   160  
   161  ### Log truncation
   162  
   163  Similarly to Replica GC, once the log is truncated to some new first index `i`,
   164  we know that no older sideloaded data is ever going to be loaded again, and we
   165  can lazily and without any locks unlink all files for which `index < i`
   166  (regardless of the term).
   167  
   168  In case of a crash before this cleanup, these files will be deleted with either
   169  the next truncation, or the replica being garbage collected.
   170  
   171  ### append()
   172  
   173  This is the interesting one. `append()` is called by Raft when it wants us to
   174  store new entries into the log.
   175  
   176  Once we commit the corresponding RocksDB batch, any sideloaded payloads must be
   177  on disk as well, or an ill-timed crash would lead to log entries which are
   178  acknowledged but have no payload associated to them, a situation impossible to
   179  recover from (OK, not impossible since we crash before sending a message out to
   180  the lease holder - we could remove the message from the log again at node
   181  startup, but it's a bad idea). So we have to make sure that all sideloaded
   182  payloads are written to disk before the batch passed to `append()` is committed,
   183  and that obsolete payloads are removed *after*.
   184  
   185  Initially, we will write the files directly in `append()` which means they are
   186  all on disk when the batch commits, but they are written under the `raftMu` and
   187  `replicaMu` locks, which is not ideal. However, it should be relatively easy to
   188  optimize it by eagerly creating the files much earlier outside of the lock; this
   189  is made possible since we disambiguate by term, and once a payload for a given
   190  index and term has been written, it will not be changed in that term.
   191  
   192  An interesting subcase arises when Raft wants us to **replace** our tail of the
   193  log, which would then have a term bump associated with it. In particular, we may
   194  need to replace a sideloaded entry with a different higher-term sideloaded
   195  entry. Thanks to disambiguation by term, both payloads can exist side by side,
   196  and we can write first the higher-term payload and then remove the replaced once.
   197  
   198  We must tolerate a file with a higher term than we know existing, though its
   199  existence should be short-lived as our Replica should learn about that higher
   200  term shortly.
   201  
   202  In summary, what we do is:
   203  
   204  - write the new payloads as early as possible; initially in `append()` but
   205    theoretically before acquiring any locks.
   206  - tolerate existing files - they are identical (as a `(term,index)` can be
   207    assigned only one log entry). Note how this encourages the write-early
   208    optimization.
   209  - remove outdated payloads as late as possible; initially after `append()`'s
   210    batch commits, later after releasing any locks.
   211  
   212  ## Details on reconstituting a `raftpb.Entry`
   213  
   214  Whenever a `raftpb.Entry` is required for transmission to a follower, we have to
   215  reconstitute it (i.e. inline the payload). This is straightforward, though a bit
   216  expensive.
   217  
   218  - check the RaftCommandVersion (can be sniffed from the `Data` slice); if it's
   219    not a sideloaded entry, do nothing. Otherwise:
   220  - decode the command, unmarshal the contained `cmd storagebase.RaftCommand`
   221  - load the on-disk payload into `cmd.SideloadedData` (term, replicaID and log
   222    index are known at this point) and compare its hash with
   223    `cmd.SideloadedSha`, failing on mismatch.
   224  - marshal and encode the command into a new `raftpb.Entry`, using the new raft
   225    command version.
   226  
   227  ## Hash function
   228  
   229  Speed matters in this application, and RocksDB internally uses CRC32. For that
   230  reason, CRC32 is deemed sufficient here.
   231  
   232  # Drawbacks
   233  
   234  There is some overlap with the project to move the Raft log out of RocksDB.
   235  However, the end goal of that is not necessarily compatible with this RFC, and
   236  the changes proposed here are agnostic of changes in the Raft log backend.
   237  
   238  # Unresolved questions
   239  
   240  ## Optimization for 1x write amplification: hard-link into RocksDB directory
   241  
   242  If we are aiming high and want 1x write amplification, then we do not want to
   243  copy the file to RocksDB; we want to hard-link it there. However, RocksDB may
   244  *alter* the SSTable. One case in which it does this is that it may set an
   245  internal sequence number used for RocksDB transactions; this happens when there
   246  are open RocksDB snapshots, for example.
   247  
   248  Knowing that this can happen, we can avoid it: the SSTable is always created
   249  with a zero sequence number, and we can ignore any updated on-disk sequence
   250  number when reading the file from disk for treating it as a log entry.
   251  
   252  However, we need to be very sure that RocksDB will not perform other
   253  modifications to the file that could trip the consistency checker.
   254  
   255  We'll exclude this from the initial implementation, though it seems
   256  straightforward enough to add later without migration headaches.
   257  
   258  ### Complications of using hard links
   259  
   260  The design above uses hard linking for SSTable ingestion for minimal write
   261  amplification. Hard links may not be supported across the board (think:
   262  virtualized environments), and we crucially rely on the fact that a file is only
   263  deleted after all referencing hardlinks have been. In these situations, an extra
   264  copy can be made; this will be exposed as either a hidden knob or cluster
   265  setting.
   266  
   267  ## Usefulness of generalization
   268  
   269  Sideloading could be useful for bulk INSERTs (non-RESTORE data loading) and
   270  DeleteRange, or more generally any proposal that's large enough to profit from
   271  reduced write amplification compared to what we have today. However, moving the
   272  raft log out of rocksdb likely already addresses that suitably.
   273  
   274  [1]: https://github.com/cockroachdb/cockroach/pull/9459/files#diff-2967750a9f426e20041d924947ff1d46R707