github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170605_dedicated_raft_storage.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170605_dedicated_raft_storage.md (about)

     1  - Feature Name: Dedicated storage engine for Raft
     2  - Status: postponed
     3  - Start Date: 2017-05-25
     4  - Authors: Irfan Sharif
     5  - RFC PR: [#16361](https://github.com/cockroachdb/cockroach/pull/16361)
     6  - Cockroach Issue(s):
     7    [#7807](https://github.com/cockroachdb/cockroach/issues/7807),
     8    [#15245](https://github.com/cockroachdb/cockroach/issues/15245)
     9  
    10  # Summary
    11  
    12  At the time of writing each
    13  [`Replica`](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L214)
    14  is backed by a single instance of RocksDB
    15  ([`Store.engine`](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/store.go#L391))
    16  which is used to store all modifications to the underlying state machine in
    17  _addition_ to storing all consensus state. This RFC proposes the separation of
    18  the two, outlines the motivations for doing so and alternatives considered.
    19  
    20  # Motivation
    21  
    22  Raft's RPCs typically require the recipient to persist information to stable
    23  storage before responding. This 'persistent state' is comprised of the latest
    24  term the server has seen, the candidate voted for in the current term (if any),
    25  and the raft log entries themselves<sup>[1]</sup>. Modifications to any of the
    26  above are [synchronously
    27  updated](https://github.com/cockroachdb/cockroach/pull/15366) on stable storage
    28  before responding to RPCs.
    29  
    30  In our usage of RocksDB, data is only persisted when explicitly issuing a write
    31  with [`sync =
    32  true`](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/engine/db.cc#L1828).
    33  Internally this also persists previously unsynchronized writes<sup>[2]</sup>.
    34  
    35  Let's consider a sequential write-only workload on a single node cluster. The
    36  internals of the Raft/RocksDB+Storage interface can be simplified to the
    37  following:
    38    - Convert the write command into a Raft proposal and
    39      [submit](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L2811)
    40      the proposal to the underlying raft group
    41    - 'Downstream' of raft we
    42      [persist](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L3120)
    43      the newly generated log entry corresponding to the command
    44    - We record the modifications to the underlying state machine but [_do
    45      not_](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L4208)
    46      persist this synchronously
    47  
    48  One can see that for the `n+1-th` write, upon persisting the corresponding raft
    49  log entry, we also end up persisting the state machine modifications from the
    50  `n-th` write. It is worth mentioning here that asynchronous writes are often
    51  more than a thousand times as fast as synchronous writes<sup>[3]</sup>. Given
    52  our current usage of the same RocksDB instance for both the underlying state
    53  machine _and_ the consensus state we effectively forego (for this particular
    54  workload at least) the performance gain to be had in not persisting state
    55  machine modifications. For `n` writes we have `n` unsynchronized and `n`
    56  synchronized writes where for `n-1` of them, we also flush `n-1` earlier
    57  unsynchronized writes to disk.
    58  
    59  By having a dedicated storage engine for Raft's persistent state we can address
    60  this specific sub-optimality. By isolating the two workloads (synchronized and
    61  unsynchronized writes) into separately running storage engines such that
    62  synchronized writes no longer flush to disk previously unsynchronized ones, for
    63  `n` writes we can have `n` unsynchronized and `n` synchronized writes (smaller
    64  payload than the alternative above).
    65  
    66  # Benchmarks
    67  
    68  As a sanity check we ran some initial benchmarks that gives us a rough idea of
    69  the performance gain to expect from this change. We benchmarked the sequential
    70  write-only workload as described in the section above and did so at the
    71  `pkg/storage{,/engine}` layers. What follows is redacted, simplified code of
    72  the original benchmarks and the results demonstrating the speedups.
    73  
    74  To simulate our current implementation (`n+1-th` synchronized write persists
    75  the `n-th` unsynchronized write) we alternate between synchronized and
    76  unsynchronized writes `b.N` times in `BenchmarkBatchCommitSharedEngine`.
    77  
    78  ```go
    79  // pkg/storage/engine/bench_rocksdb_test.go
    80  
    81  func BenchmarkBatchCommitSharedEngine(b *testing.B) {
    82      for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} {
    83          b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
    84              // ...
    85              for i := 0; i < b.N; i++ {
    86                  {
    87                      // ...
    88                      batch := eng.NewWriteOnlyBatch()
    89                      MVCCBlindPut(batch, key, value)
    90  
    91                      // Representative of persisting a raft log entry.
    92                      batch.Commit(true)
    93                  }
    94                  {
    95                      // ...
    96                      batch := eng.NewWriteOnlyBatch()
    97                      MVCCBlindPut(batch, key, value)
    98  
    99                      // Representative of an unsynchronized state machine write.
   100                      batch.Commit(false)
   101                  }
   102              }
   103          })
   104      }
   105  }
   106  ```
   107  
   108  To simulate the propose workload (`n` synchronized and `n` unsynchronized
   109  writes, independent of one another) we simply issue `b.N` synchronized and
   110  unsynchronized writes to a separate RocksDB instances in
   111  `BenchmarkBatchCommitDedicatedEngines`.
   112  
   113  ```go
   114  func BenchmarkBatchCommitDedicatedEngines(b *testing.B) {
   115      for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} {
   116          b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
   117              // ...
   118              for i := 0; i < b.N; i++ {
   119                  {
   120                      // ...
   121                      batch := engA.NewWriteOnlyBatch()
   122                      MVCCBlindPut(batch, key, value)
   123  
   124                      // Representative of persisting a raft log entry.
   125                      batch.Commit(true)
   126                  }
   127                  {
   128                      // ...
   129                      batch := engB.NewWriteOnlyBatch()
   130                      MVCCBlindPut(batch, key, value)
   131  
   132                      // Representative of an unsynchronized state machine write.
   133                      batch.Commit(false)
   134                  }
   135              }
   136          })
   137      }
   138  }
   139  ```
   140  
   141  ```sh
   142  ~ benchstat perf-shared-engine.txt perf-dedicated-engine.txt
   143    name                      old time/op    new time/op     delta
   144    BatchCommit/vs=1024-4       75.4µs ± 4%     70.2µs ± 2%   -6.87%  (p=0.000 n=19+17)
   145    BatchCommit/vs=4096-4        117µs ± 5%      106µs ± 7%   -9.76%  (p=0.000 n=20+20)
   146    BatchCommit/vs=16384-4       325µs ± 7%      209µs ± 5%  -35.55%  (p=0.000 n=20+18)
   147    BatchCommit/vs=65536-4      1.05ms ±10%     1.08ms ±20%     ~     (p=0.718 n=20+20)
   148    BatchCommit/vs=262144-4     3.52ms ± 6%     2.81ms ± 7%  -20.30%  (p=0.000 n=17+18)
   149    BatchCommit/vs=1048576-4    11.2ms ±18%      7.8ms ± 5%  -30.56%  (p=0.000 n=20+20)
   150  
   151    name                      old speed      new speed       delta
   152    BatchCommit/vs=1024-4     13.6MB/s ± 4%   14.6MB/s ± 2%   +7.34%  (p=0.000 n=19+17)
   153    BatchCommit/vs=4096-4     34.9MB/s ± 5%   38.7MB/s ± 7%  +10.88%  (p=0.000 n=20+20)
   154    BatchCommit/vs=16384-4    50.5MB/s ± 8%   78.4MB/s ± 5%  +55.04%  (p=0.000 n=20+18)
   155    BatchCommit/vs=65536-4    62.6MB/s ± 9%   61.1MB/s ±17%     ~     (p=0.718 n=20+20)
   156    BatchCommit/vs=262144-4   74.5MB/s ± 5%   93.5MB/s ± 7%  +25.43%  (p=0.000 n=17+18)
   157    BatchCommit/vs=1048576-4  94.8MB/s ±16%  135.2MB/s ± 5%  +42.57%  (p=0.000 n=20+20)
   158  ```
   159  
   160  NOTE: 64 KiB workloads don't exhibit the same performance increase as compared
   161  to other workloads, this is unexpected and needs to be investigated further.
   162  See [drawbacks](#drawbacks) for more discussion on this.
   163  
   164  Similarly we ran the equivalent benchmarks at the `pkg/storage` layer:
   165  
   166  ```go
   167  // pkg/storage/replica_raftstorage_test.go
   168  
   169  func BenchmarkReplicaRaftStorageSameEngine(b *testing.B) {
   170      for _, valueSize := range []int{1 << 10, 1 << 12, ... , 1 << 20} {
   171          b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
   172              // ...
   173              rep := tc.store.GetReplica(rangeID)
   174              rep.redirectOnOrAcquireLease()
   175  
   176              defer settings.TestingSetBool(&syncRaftLog, true)()
   177  
   178              for i := 0; i < b.N; i++ {
   179                  // ...
   180                  client.SendWrappedWith(rep, putArgs(key, value))
   181              }
   182          })
   183      }
   184  }
   185  ```
   186  
   187  To simulate the proposed workload (`n` synchronized and `n` unsynchronized
   188  writes, independent of one another) we simply issue `b.N` synchronized writes
   189  followed by `b.N` unsynchronized ones in
   190  `BenchmarkReplicaRaftStorageDedicatedEngine`. To see why this is equivalent
   191  consider a sequence of alternating/interleaved synchronized and unsynchronized
   192  writes where synchronized writes do not persist the previous unsynchronized
   193  writes. If `S` is the time taken for a synchronized write and `U` is the time
   194  taken for an unsynchronized one,
   195  `S + U + S + U + ...  S + U == S + S + ... + S + U + U + ... + U`.
   196  
   197  ```go
   198  // NOTE: syncApplyCmd is set to true to synchronize command applications (state
   199  // machine changes) to persistent storage. Changes to pkg/storage/replica.go
   200  // not shown here.
   201  func BenchmarkReplicaRaftStorageDedicatedEngine(b *testing.B) {
   202      for _, valueSize := range []int{1 << 10, 1 << 12, ... , 1 << 20} {
   203          b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
   204              // ...
   205              rep := tc.store.GetReplica(rangeID)
   206              rep.redirectOnOrAcquireLease()
   207  
   208              defer settings.TestingSetBool(&syncRaftLog, false)()
   209              defer settings.TestingSetBool(&syncApplyCmd, false)()
   210  
   211              for i := 0; i < b.N/2; i++ {
   212                  // ...
   213                  client.SendWrappedWith(rep, putArgs(key, value))
   214              }
   215  
   216              defer settings.TestingSetBool(&syncRaftLog, true)()
   217              defer settings.TestingSetBool(&syncApplyCmd, true)()
   218  
   219              for i := b.N/2; i < b.N; i++ {
   220                  // ...
   221                  client.SendWrappedWith(rep, putArgs(key, value))
   222              }
   223          })
   224      }
   225  }
   226  ```
   227  
   228  ```sh
   229  ~ benchstat perf-storage-alternating.txt perf-storage-sequential.txt
   230    name                             old time/op  new time/op  delta
   231    ReplicaRaftStorage/vs=1024-4      297µs ± 9%   268µs ± 3%   -9.73%  (p=0.000 n=10+10)
   232    ReplicaRaftStorage/vs=4096-4      511µs ±10%   402µs ± 1%  -21.29%  (p=0.000 n=9+10)
   233    ReplicaRaftStorage/vs=16384-4    2.16ms ± 2%  1.39ms ± 4%  -35.70%  (p=0.000 n=10+10)
   234    ReplicaRaftStorage/vs=65536-4    3.60ms ± 3%  3.49ms ± 4%   -3.17%  (p=0.003 n=10+9)
   235    ReplicaRaftStorage/vs=262144-4   10.3ms ± 7%  10.2ms ± 3%     ~     (p=0.393 n=10+10)
   236    ReplicaRaftStorage/vs=1048576-4  40.3ms ± 7%  40.8ms ± 3%     ~     (p=0.481 n=10+10)
   237  ```
   238  
   239  # Detailed design
   240  
   241  We propose introducing a second RocksDB instance to store all raft consensus
   242  data. This RocksDB instance will be specific to a given store (similar to our
   243  existing setup) and will be addressable via a new member variable on `type
   244  Store`, namely `Store.raftEngine` (the existing `Store.engine` will stay as
   245  is). This instance will consequently manage the raft log entries for all the
   246  replicas that belong to that store. This will be stored in a subdirectory
   247  `raft` under our existing RocksDB storage directory.
   248  At the time of writing the keys that would need to be written to the new engine
   249  are the log keys and `HardState`<sup>[4](#column-families)</sup>.
   250  
   251  ## Implementation strategy
   252  
   253  We will phase this in bottom-up by first instantiating the new RocksDB instance
   254  with reasonable default configurations (see [unresolved
   255  questions](#unresolved-questions) below) at the `pkg/storage` layer (as opposed
   256  to using the user level store specifications specified via the `--stores` flag
   257  in the `cockroach start` command). The points at which raft data is written out
   258  to our existing RocksDB instance, we will additionally write them out to our
   259  new instance. Following this at any point where raft data is read, we will
   260  _also_ read from the new instance and compare. At this point we can wean off
   261  all Raft specific reads/writes and log truncations from the old instance and
   262  have them serviced by the new.
   263  
   264  Until the [migration story](#migration-strategy) is hashed out it's worthwhile
   265  structuring this transition behind an environment variable that would determine
   266  _which_ instance all Raft specific reads, writes and log truncations are
   267  serviced from. This same mechanism could also be used to collect 'before and
   268  after' performance numbers (at the very least as a sanity check).
   269  We expect individual writes going through the system to speed up (on average we
   270  can assume for every synchronized raft log entry write currently we're flushing
   271  out a previously unsynchronized state machine transition). We expect read
   272  performance to stay relatively unchanged.
   273  
   274  **NB**: There's a subtle edge case to be wary of with respect to raft log
   275  truncations, before truncating the raft log we need to ensure that the
   276  application of the truncated entries have actually been persisted, i.e. for
   277  `Put(k,v)` the primary RocksDB engine must have synced `(k,v)` before
   278  truncating that `Put` operation from the Raft log.
   279  Given we expect the `ReplicaState` to be stored in the first engine let's
   280  consider the case where we've truncated a set of log entries and the
   281  corresponding `TruncatedState`, to be stored on the first engine, is _not_
   282  synchronized to disk. If the node crashes at this point it will fail to load
   283  the `TruncatedState` and has no way to bridge the gap between the last
   284  persisted `ReplicaState` and the oldest entry in the truncated Raft log.</br>
   285  To this end whenever we truncate, we need to _first_ sync the primary RocksDB
   286  instance. Given RocksDB periodically flushes in-memory writes to disk, if we
   287  can detect that the application of the entries to be truncated have _already_
   288  been persisted, we can avoid this step. See [future work](#future-work) for a
   289  possible extension to this.
   290  We should note that this will be the only time we _explicitly_ sync the primary
   291  instance to disk, the performance blips that arise due to this will be
   292  interesting to study.
   293  
   294  Following this the `storage.NewStore` API will be amended to take in two
   295  storage engines instead (the new engine is to be used as dedicated raft
   296  storage). This API change propagates through any testing code that
   297  bootstraps/initializes a node in some way, shape, or form. At the time or
   298  writing the tests affected are spread across `pkg/{kv,storage,server,ts}`.
   299  
   300  ## Migration strategy
   301  
   302  A general migration process, in addition to solving the problem for existing
   303  clusters, could be used to move the raft RocksDB instance from one location to
   304  another (such as to the non-volatile memory discussed below).
   305  How do we actually start using this RocksDB instance? One thought is that
   306  only new nodes would use a separate RocksDB instance for the Raft log.
   307  An offline migration alternative that would work for existing clusters and
   308  rolling restarts could be the following:
   309  - We detect that we're at the new version with the changes that move the
   310    consensus state to a separate RocksDB instance, and we have existing
   311    consensus data stored in the old
   312  - We copy over _all_ consensus data (for all replicas on that given store) from
   313    the old to the new and delete it in the old
   314  - As the node is up and running all Raft specific reads/writes are directed to
   315    the new
   316  We should note that we don't have precedent for an offline, store-level
   317  migration at this time.
   318  
   319  An online approach that could enable live migrations moving consensus state
   320  from one RocksDB instance to another would be the following:
   321  - for a given replica we begin writing out all consensus state to _both_ RocksDB
   322    instances, the instance the consensus state is being migrated to and the
   323    instance it's being migrated from (at this point we still only exclusively
   324    read from the old instance)
   325  - at the next log truncation point we set a flag such that
   326      - all subsequent Raft specific reads are directed to the new instance
   327      - all subsequent Raft specific writes are _only_ directed to the new instance
   328  - we delete existing consensus state (pertaining to the given replica) from the
   329    old instance. This already happens (to some degree) in normal operation given we're
   330    truncating the log
   331  
   332  The implementation of the latter strategy is out of the scope for this RFC, the
   333  offline store-level migration alternative should suffice.
   334  
   335  ## TODO
   336  - Investigate 'Support for Multiple Embedded Databases in the same
   337    process'<sup>[5]</sup>
   338  
   339  # Drawbacks
   340  
   341  None that are immediately obvious. We'll have to pay close attention to how the
   342  separate RocksDB instances interact with one another, the performance
   343  implications can be non-obvious and subtle given the sharing of hardware
   344  resources (disk and/or OS buffers).
   345  To demonstrate this consider the following two versions of a benchmark, only
   346  difference being that in one we have a single loop body and write out `b.N` synced
   347  and unsynced writes (interleaved) versus two loop bodies with `b.N` synced and
   348  unsynced writes each:
   349  
   350  ```go
   351  func BenchmarkBatchCommitInterleaved(b *testing.B) {
   352      for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} {
   353          b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
   354              // ...
   355              for i := 0; i < b.N; i++ {
   356                  // ...
   357                  batchA := engA.NewWriteOnlyBatch()
   358                  MVCCBlindPut(batchA, key, value)
   359                  batchA.Commit(true)
   360  
   361                  // ...
   362                  batchB := engB.NewWriteOnlyBatch()
   363                  MVCCBlindPut(batchB, key, value)
   364                  batchB.Commit(false)
   365              }
   366          })
   367      }
   368  }
   369  
   370  func BenchmarkBatchCommitSequential(b *testing.B) {
   371      for _, valueSize := range []int{1 << 10, 1 << 12, 1 << 20} {
   372          b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
   373              // ...
   374              for i := 0; i < b.N; i++ {
   375                  // ...
   376                  batchA := engA.NewWriteOnlyBatch()
   377                  MVCCBlindPut(batchA, key, value)
   378                  batchA.Commit(true)
   379              }
   380  
   381              for i := 0; i < b.N; i++ {
   382                  // ...
   383                  batchB := engB.NewWriteOnlyBatch()
   384                  MVCCBlindPut(batchB, key, value)
   385                  batchB.Commit(false)
   386              }
   387          })
   388      }
   389  }
   390  ```
   391  
   392  Here are the performance differences, especially stark for 64KiB workloads:
   393  ```sh
   394  ~ benchstat perf-interleaved.txt perf-sequential.txt
   395    name                      old time/op    new time/op     delta
   396    BatchCommit/vs=1024-4       70.1µs ± 2%     68.6µs ± 5%   -2.15%  (p=0.021 n=8+10)
   397    BatchCommit/vs=4096-4        102µs ± 1%       97µs ± 7%   -4.10%  (p=0.013 n=9+10)
   398    BatchCommit/vs=16384-4       207µs ± 5%      188µs ± 4%   -9.46%  (p=0.000 n=9+9)
   399    BatchCommit/vs=65536-4      1.07ms ±12%     0.62ms ± 9%  -41.90%  (p=0.000 n=8+9)
   400    BatchCommit/vs=262144-4     2.90ms ± 8%     2.70ms ± 4%   -6.68%  (p=0.000 n=9+10)
   401    BatchCommit/vs=1048576-4    8.06ms ± 9%     7.90ms ± 5%     ~     (p=0.631 n=10+10)
   402  
   403    name                      old speed      new speed       delta
   404    BatchCommit/vs=1024-4     14.6MB/s ± 2%   14.9MB/s ± 4%   +2.22%  (p=0.021 n=8+10)
   405    BatchCommit/vs=4096-4     40.3MB/s ± 1%   42.1MB/s ± 7%   +4.37%  (p=0.013 n=9+10)
   406    BatchCommit/vs=16384-4    78.6MB/s ± 5%   87.4MB/s ± 3%  +11.09%  (p=0.000 n=10+9)
   407    BatchCommit/vs=65536-4    61.6MB/s ±13%  105.5MB/s ± 8%  +71.32%  (p=0.000 n=8+9)
   408    BatchCommit/vs=262144-4   90.6MB/s ± 7%   97.1MB/s ± 4%   +7.13%  (p=0.000 n=9+10)
   409    BatchCommit/vs=1048576-4   130MB/s ± 8%    133MB/s ± 5%     ~     (p=0.631 n=10+10)
   410  ```
   411  
   412  Clearly the separately running instances are not as isolated as expected.
   413  
   414  # Alternatives
   415  
   416  An alternative considered was rolling out our own WAL implementation optimized
   417  for the Raft log usage patterns. Possible reasons for doing so:
   418  - A native implementation in Go would avoid the CGo overhead we incur crossing
   419    the Go/C++ boundary
   420  - SanDisk published a paper<sup>[6]</sup> (a shorter slideshow can be found
   421    [here](https://www.usenix.org/sites/default/files/conference/protected-files/inflow14_slides_yang.pdf))
   422    discussing the downsides of layering log systems on one another. Summary:
   423    - Increased write pressure - each layer/log has its own metadata
   424    - Fragmented logs - 'upper level' logs writes sequentially but the 'lower
   425      level' logs gets mixed workloads, most likely to be random, destroying
   426      sequentiality
   427    - Unaligned segment sizes - garbage collection in 'upper level' log segments
   428      can result in data invalidation across multiple 'lower level' log segments
   429  
   430  Considering how any given store could have thousands of replicas, an approach
   431  with each replica maintaining it's own separate file for it's WAL was a
   432  non-starter. What we would really need is something that resembled a
   433  multi-access, shared WAL (by multi-access here we mean there are multiple
   434  logical append points in the log and each accessor is able to operate only it's
   435  own logical section).
   436  
   437  Consider what would be the most common operations:
   438  - Accessing a given replica's raft log sequentially
   439  - Prefix truncation of a given replica's raft log
   440  
   441  A good first approximation would be allocating contiguous chunks of disk space
   442  in sequence, each chunk assigned to a given accessor. Should an accessor run
   443  out of allocated space, it seeks the next available chunk and adds metadata
   444  linking across the two (think linked lists). Though this would enable fast
   445  sequential log access, log prefix truncations are slightly trickier. Do we
   446  truncate at chunk sized boundaries or truncate at user specified points and
   447  thus causing fragmentation?
   448  Perusing through open sourced implementations of WALs and some literature on
   449  the subject, multi-access WALs tend to support _at most_ 10 accessors, let
   450  alone thousands. Retro-fitting this for our use case (a single store can have
   451  1000s of replicas), we'd have to opt-in for a 'sharded store' approach where
   452  appropriately sized sets of replicas share an instance of a multi-access WAL.
   453  
   454  Taking into account all of the above, it was deemed that the implementation
   455  overhead and the additional introduced complexity (higher level organization
   456  with sharded stores) was not worth what could _possibly_ be a tiny performance
   457  increase. We suspect a tuned RocksDB instance would be hard to beat unless we
   458  GC aggressively, not to mention it's battle tested. The internal knowledge base
   459  for tuning and working with RocksDB is available at CockroachDB, so this
   460  reduces future implementation risk as well.</br>
   461  NOTE: At this time we have not explored potentially using
   462  [dgraph-io/badger](https://github.com/dgraph-io/badger) instead.
   463  
   464  Some Raft WAL implementations explored were the
   465  [etcd/wal](https://github.com/coreos/etcd/tree/master/wal) implementation and
   466  [hashicorp/raft](https://github.com/hashicorp/raft)'s LMDB
   467  [implementation](https://github.com/hashicorp/raft-mdb). As stated above the
   468  complexity comes about in managing logs for 1000s of replicas on the same
   469  store.
   470  
   471  # Unresolved questions
   472  
   473  - RocksDB parameters/tuning for the Raft specific instance.
   474  - We currently share a block cache across the multiple running RocksDB
   475    instances across stores in a node. Would a similar structure be beneficial
   476    here? Do we use the same block cache or have another dedicated one as well?
   477  
   478  # Future work
   479  
   480  Intel has demonstrated impressive performance increases by putting the raft log
   481  in non-volatile memory instead of disk (for etcd/raft)<sup>[7]</sup>. Given
   482  we're proposing a separate storage engine for the Raft log, in the presence of
   483  more suitable hardware medium it should be easy enough to configure the Raft
   484  specific RocksDB instance/multi-access WAL implementation to run on it. Even
   485  without the presence of specialized hardware it might be desirable to configure
   486  the Raft and regular RocksDB instances to use different disks.
   487  
   488  RocksDB periodically flushes in-memory writes to disk, if we can detect which
   489  writes have been persisted and use _that_ information to truncate the
   490  corresponding raft log entries, we can avoid the (costly) explicit syncing of
   491  the primary RocksDB instance. This is out of the scope for this RFC.
   492  
   493  As an aside, [@tschottdorf](https://github.com/tschottdorf):
   494  > we should refactor the way `evalTruncateLog` works. It currently
   495  > takes writes all the way through the proposer-evaluated KV machinery, and at
   496  > least from the graphs it looks that that's enough traffic to impair Raft
   497  > throughput alone. We could lower the actual ranged clear below Raft (after all,
   498  > no migration concerns there). We would be relaxing, somewhat, the stats which
   499  > are now authoritative and would then only become "real" once the Raft log had
   500  > actually been purged all the way up to the TruncatedState. I think there's no
   501  > problem with that.
   502  
   503  
   504  # Footnotes
   505  \[1\]: https://raft.github.io/raft.pdf </br>
   506  \[2\]: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ </br>
   507  \[3\]: https://github.com/facebook/rocksdb/wiki/Basic-Operations#asynchronous-writes </br>
   508  <a name="column-families">\[4\]</a>: via
   509  [@bdarnell](https://github.com/bdarnell) &
   510  [@tschottdorf](https://github.com/tschottdorf):
   511    > We may want to consider using two column families for this, since the log
   512    > keys are _usually_ (log tail can be replaced after leadership change)
   513    > write-once and short-lived, while the hard state is overwritten frequently
   514    > but never goes away completely.
   515  
   516  \[5\]: https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#support-for-multiple-embedded-databases-in-the-same-process </br>
   517  \[6\]: https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf </br>
   518  \[7\]: http://thenewstack.io/intel-gives-the-etcd-key-value-store-a-needed-boost/ </br>
   519  
   520  
   521  [1]: https://raft.github.io/raft.pdf
   522  [2]: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ
   523  [3]: https://github.com/facebook/rocksdb/wiki/Basic-Operations#asynchronous-writes
   524  [5]: https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#support-for-multiple-embedded-databases-in-the-same-process
   525  [6]: https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf
   526  [7]: http://thenewstack.io/intel-gives-the-etcd-key-value-store-a-needed-boost/